1 Introduction

Cantor Digitalis is a singing instrument, i.e., a performative singing synthesis system. It allows for expressive musical control of high-quality vocal sounds. Expressive musical control is provided by an effective human-computer interface that captures the player’s gestures and converts them into synthesis control parameters [1, 2]. High-quality vocal sounds are produced by the synthesis engine, which features a specially designed formant synthesizer and an elaborate set of singing rules [35].

Cantor Digitalis is a musical instrument, and it is regularly played on stage by Chorus Digitalis 1, the choir of Cantor Digitalis. The expressiveness and sound quality of this innovative musical instrument have been recognized, as it was awarded the first prize of the 2015 International Margaret Guthman Musical Instrument Competition (Georgia Institute of Technology)2. Cantor Digitalis is distributed as a free software, accompanied by a detailed documentation. However, the scientific basis and technical details underlying Cantor Digitalis have never been published and discussed. The aim of the present work is to provide a comprehensive technical description of Cantor Digitalis, including the interface, the formant synthesizer, and the singing synthesis rules.

The sound synthesis components of Cantor Digitalis are in the tradition of formant synthesis. Apart from tape-based music using recorded voices and vocoders, synthetic voices first appeared in contemporary music pieces thanks to the “Chant” program [6]. “Chant” was based on a formant voice synthesizer and synthesis by rules, i.e., a parametric model of voice production3. Other research groups also proposed rule-based formant synthesizers [7, 8]. The main advantage of parametric synthesis is its flexibility and economy in terms of memory and computational load. The next generation of voice synthesis systems was based on recording, concatenation, and modification of real voice samples4 or statistical parametric synthesis [9]. A formant synthesizer is preferred for Cantor Digitalis because flexibility and real time are the main issues for performative singing synthesis.

Singing instruments have been proposed by different research groups [3, 1013]. The graphic tablet has been proposed for approximately a decade [10, 14, 15] for controlling intonation and voice source variation. This interface appeared as a very effective choice. It has been extensively tested for intonation control in speech and singing synthesis [16, 17]. Additionally, this interface allows much expressiveness [18] because it takes advantage of the accuracy and precision acquired through writing/drawing gestures. This is the interface chosen for Cantor Digitalis.

This article describes the three main original components of Cantor Digitalis: the interface, the synthesis engine, and the rules for converting the input of the former into parameters for the latter. The next section presents the general architecture of Cantor Digitalis and the main issues and choices related to chironomic control of the singing voice. Section 3 presents the parametric formant synthesizer. Section 4 describes the synthesis rules, i.e., the transformation of parameters issued by the chironomic interface into synthesizer parameters. Section 5 discusses the obtained results, illustrated by audio-visual files, and proposes some directions for future work.

2 Chironomic control of the singing voice

The Cantor Digitalis architecture is illustrated in Fig. 2. It is composed of three layers: the interface, the synthesis/mapping rules, and the parametric synthesizer. This architecture follows the path of music production, from the player to sound. Initially, the musician plans to produce a given musical phrase, with a given vowel, given dynamics, given voice quality, and so forth. The planned musical task is then expressed through hand gestures related to the interface, i.e., through motions of a stylus and fingers on the graphic tablet (after selection of the voice type and other presets). The interface captures high-level parameters that are perceptually relevant to the player, such as the vowel quality or pitch. These high-level parameters are then converted into low-level synthesis parameters through a layer of synthesis/mapping rules. Low-level synthesis parameters drive the parametric voice synthesizer for sound sample production. The resulting sound is played back; listened to by the musician, who reacts accordingly; and the perception-action loop for performative singing synthesis is closed. Before addressing the control method itself, the control parameters must be identified.

2.1 Singing voice parameters

Cantor Digitalis is restricted to vocalic sounds (the case of consonants being considerably more difficult for real-time high-quality musical control [5, 19]). The corresponding parameters are pitch, voice force (or vocal effort), voice quality, and vowel label. The main perceived dimensions of voice quality are [20] voice tension (lax/tense voice), noise (aspiration noise in the voice resulting in breathiness and structural aperiodicities such as vocal jitter or shimmer resulting in roughness or hoarseness), and vocal tract size (or larynx height). All high-level parameters are listed below:

  • Pitch P corresponds to the perceived melodic dimension of voice sounds. It is often the most important musical dimension.

  • Vocal effort E corresponds to the dynamics, i.e., perceived force of vocal sounds. It is also an essential musical dimension.

  • Vowel height H defines the openness or closeness of the vowel and corresponds to the vertical axis in the vocalic triangle, a classical two-dimensional representation for vowels. This dimension is related to the vertical position of the tongue, which depends on the aperture of the jaw.

  • Vowel backness V defines the front-back position of the vowel and corresponds to the horizontal axis of the vocalic triangle. It is related to the position of the tongue relative to the teeth and the back of the mouth.

  • The noise dimension in vocal sounds can be decomposed into two components. The first component is roughness or hoarseness R due to structural aperiodicities, i.e., random pitch period or amplitude perturbations. It defines the hoarse or rough quality of the voice.

  • Breathiness B is the second noise dimension. Leakage at the glottis produces aspiration or breath noise in the voice. The extreme case of breathiness is whispering, with no fold vibration, resulting in unvoiced vowels.

  • Tenseness T defines the tense/lax quality of the voice, i.e., the degree of adduction/abduction of the vocal folds.

  • Vocal tract size S defines the apparent vocal tract size of the singer. Vocal tract size is singer dependent, but it also varies according to the larynx position or lips rounding for the same individual.

  • Pitch range is singer dependent. A typical singer range is approximately 2 octaves. For simplicity, a unique pitch range size of 3 octaves is implemented. To play either low (e.g., bass) or high (e.g., soprano) voices, a pitch offset parameter P 0 is introduced.

  • Laryngeal vibration mechanism M defines the vibration mode of the vocal folds used by the singer. Only chest and falsetto mechanisms are used, corresponding to M=1 and M=2, respectively.

All these dimensions are expressed in normalized units (between 0 and 1), except P 0 and M, and are summarized in Table 1. For most musical performances, only vowel label, pitch, and vocal effort are controlled with the help of chironomy. The other dimensions are controlled using a graphical user interface (GUI) on a computer. Note that other shares of parameter controls between the GUI and chironomy are possible; for instance, modulation of breathiness, voice tension, or vocal tract length can be assigned to the stylus (see Table 2 for musical examples).

Table 1 High-level parameter control
Table 2 Examples of voice types, parameter variations, and parameter dependency rules, including public performance sequences (with corresponding sounds and videos)

2.2 Chironomic control: an augmented graphic-touch tablet

Following previous experiments, a Wacom Intuos 5M-touch tablet has been chosen as the interface, allowing for bi-manual chironomic control: the tablet detects the position of the pen pressure over the 2D plan, as well as the finger position over the surface. This interface is preferred for two main reasons. On the one hand, it is reactive, with no noticeable latency. The time resolution of the Wacom Intuos 5M tablets is 5 ms with a pen and 20 ms with a finger. This resolution proved short enough to provide the player with the feeling of direct causality between gesture and sound, similar to that for an acoustical instrument. On the other hand, this interface provides a fine spatial resolution, avoiding noticeable quantization effects in parameter variation. Wacom Intuos 5M tablets have a spatial resolution of 5080 lines per inch (0.005 mm) with a pen tip diameter of approximately 0.25 mm and 2048 levels of pressure. In addition, this interface allows for accurate, reproducible, and intuitive gestures. The pen tablet takes advantage of our writing ability, developed since childhood. The touch technology also takes advantage of the widespread habit of finger gestures on phones and computer tablets.

For increased intonation accuracy, the tablet is equipped with visual references. A printed template is superimposed on the active zone of the tablet with pitch and vowel targets. The top of Fig. 1 summarizes the interface part of Cantor Digitalis. Melodic accuracy and precision with this interface are comparable or better than those obtained by singers [17].

Fig. 1
figure 1

Bi-manual control of the vocalic color (chosen in the 2D rectangle space with the finger position of the non-preferred hand) and the pitch (controlled with the preferred hand by the pen position along the X-dimension). Middle panel: full view of the tablet chromatic pattern. Bottom panel: full view of the tablet raga-Yaman pattern

2.3 Voice source control

Melody and dynamics are the most important musical features. They are associated with the tablet’s pen handled by the preferred hand to guarantee the best possible accuracy for intonation and dynamics. Pitch P is controlled by the X-stylus position, in a left-right organization similar to a keyboard. For accurate pitch targeting, the template attached on the tablet is carefully calibrated. It can represent a keyboard, a guitar fingerboard, or any specific melodic arrangement (e.g., the notes of a given mode or a raga), depending on the musical purpose. Examples of a “keyboard” template, with black and white keys, and a template for an indian modal scale (raga Yaman), are presented in Fig. 1.

For the flat and continuous tablet surface, a template representing the melodic scale is needed. The exact pitch (for each note of the melodic scale) corresponds to the printed key center line because whereas in a traditional keyboard the same pitch is associated to the whole key width, the pitch varies continuously according to the pen position on the tablet. The thick vertical lines correspond to the pitches of the chromatic keys, while the thin vertical lines correspond to the diatonic keys. Note that a dynamic intonation correction algorithm is available. This option can help less experienced users or for virtuoso passages [21].

Vocal effort E is the second main voice source parameter. It controls musical dynamics and does not need as much precision as pitch. The pen pressure has been chosen because of its analogy with vocal effort: a harder pressure corresponds to a higher vocal effort and a louder sound. Additionally, voice sound production occurs when the air flow of the lungs oversteps a phonation threshold. This threshold represents the sub-glottal pressure required to start vocal fold vibration. Thus, we introduced a vocal effort threshold E thr under which no voiced sound is produced. A linear mapping between the stylus tip pressure and the vocal effort parameter appeared convenient.

2.4 Control of the vocalic space

Vowel label control is assigned to the non-preferred hand. The vocalic space is represented by a two-dimensional vocalic triangle or trapezium [22]. The two axes match the opening degree of the jaw (open-close vowel axis) and the position of the tongue in the mouth (antero-posterior vowel axis), respectively. The Wacom Intuos 5M-Touch tablet allows for the use of fingers at the same time as the stylus. The two dimensions of the vocalic space H and V can be controlled by the two-dimensional positions of a finger, y and x, respectively. French vowels are represented in a specific area in the left-top corner of the tablet (see Fig. 1 and Table 3).

Table 3 Base vocalic formant center frequencies, bandwidths, and amplitudes

2.5 Control of the voice quality

Voice quality dimensions are controlled using a GUI on the computer screen. Each parameter (roughness R, tension T, breathiness B, vocal tract size S, and laryngeal vibratory mechanism M) corresponds to a slider. Because only three octaves can be represented on the tablet, the pitch range is selected among seven possibilities. The voice dimensions used in Cantor Digitalis along with their control types are presented in Table 1.

3 Parametric formant synthesizer

3.1 Formant synthesizer architecture

The sound of Cantor Digitalis is computed by a formant synthesizer [23, 24], based on the linear model of speech production [25]. The main advantages of formant synthesis are its low computational cost, allowing for real-time processing, and the parametric representation of the vocal sounds, allowing for full control of voice type and voice quality. A new parallel/series formant synthesizer has been designed, and its general architecture is shown in Fig. 2 (bottom). According to the source-filter theory of speech production, the vocal sound \(\mathcal {S}\) in the spectral domain is the product of a glottal flow derivative model \(\mathcal {G'}\) and a vocal tract model \(\mathcal {V}\). The glottal flow derivative model \(\mathcal {G'}\) is composed of two elements: periodic pulses weighted by a factor A g , filtered by the glottal formant response GF and the spectral tilt response ST, and a Gaussian white-noise \(\mathcal {N}\), filtered by a bandpass filter NS and pondered by a factor A n and the harmonic part GF×ST. The vocal tract model \(\mathcal {V}\) is the sum of resonant filter responses R i pondered by an anti-resonance filter response BQ.

$$ \begin{aligned} \mathcal{S}(f) &= \mathcal{G'}\left(f\right)\mathcal{V}\left(f\right)\\ &= \left(\sum_{n} \delta \left(f- n f_{0}\right) \mathsf{GF}\left(f\right) \mathsf{ST} \left(f\right) \right.\\ &\quad\left. + A_{n}\left[\sum\limits_{n} \delta \left(f- n f_{0}\right) \mathsf{GF}\left(f\right) \mathsf{ST} \left(f\right) \right] \otimes \left[ \mathcal{N}\left(f\right) \mathsf{NS} \left(f\right) \right]\right) \\ &\quad\times \mathsf{BQ}\left(f\right) {\sum\nolimits}_{i=1}^{5} \mathsf{R_{i}} \left(f\right) \end{aligned} $$
(1)
Fig. 2
figure 2

The three main components of Cantor Digitalis: interface, rules, and synthesizer (see text for details)

Note that the glottal source derivative is used for the source. Assuming that the lip radiation component of the speech production model can be modeled as a derivation and that the source-filter model is linear, the radiation component can be included directly in the source component. Figure 2 and each term of Eq. 1 are explained in detail in the following sections.

3.2 Voice source model

A parametric model of the glottal flow derivative, equivalent to the LF model [26], is used for the voiced source. The model is described in the spectral domain, according to previous results [27]. The spectral approach is well suited to real-time implementation because of a low computational load. The perceptive parameters of voice quality are genuinely linked to spectral descriptions, such as spectral richness or harmonic amplitudes.

3.2.1 A spectral model

The first version of Cantor Digitalis used the causal anti-causal linear voice source model (CALM) [13, 28]. In the current version of Cantor Digitalis, a simpler version is used, computing only the magnitude spectrum of the glottal flow derivative (not the phase spectrum). The magnitude spectrum is a combination of a spectral peak, the glottal formant, and dynamic slope variation for high frequencies.

The model is described by five parameters: fundamental frequency f 0, glottal formant frequency F g and bandwidth B g , maximum excitation (glottal flow derivative negative peak) A g , and high frequency attenuations \(T_{l_{1}}\) and \(T_{l_{2}}\) (or spectral tilt).

3.2.2 Glottal formant

For each voicing period, the glottal flow derivative is computed with two cascaded linear filters: the glottal formant filter represents the main source-related spectral peak, and the spectral tilt filter represents high-frequency voice variation. The glottal formant is computed in time domain as the impulse response of the first linear filter GF (GF box in Fig. 2).

The transfer function of the glottal formant is computed using a 2-pole 1-zero digital resonant filter in series with a 1-zero derivation filter standing for the lip radiation component (T s being the sampling period, fixed at 1/96000 sec) [28] 5:

$$ {} \mathsf{GF}(z) = - \frac {A_{g} z^{-1} \left(1-z^{-1}\right) } {1 - 2 e^{- \pi B_{g} T_{s}} \cos(2 \pi F_{g} T_{s}) z^{-1} + e^{-2 \pi B_{g} T_{s}} z^{-2}} $$
(2)

3.2.3 Spectral tilt

A 2-pole 2-zero low-pass filter accounts for the spectral slope in high frequencies (ST box in Fig. 2). Its transfer function reads as (derived from [28])

$$ \mathsf{ST}(z)=\mathsf{ST_{1}}(z) \times \mathsf{ST_{2}}(z) $$
(3)

where

$$ \mathsf{ST_{i}}(z)=\frac{1 - (\nu_{i} - \sqrt{{\nu_{i}^{2}}-1}) }{1-(\nu_{i} - \sqrt{{\nu_{i}^{2}}-1}) z^{-1}}, i=1,2 $$
(4)
$$ \nu_{i} = 1 - \frac{cos(2\pi 3000 T_{s})-1}{10^{T_{l_{i}}/10}-1} $$
(5)

with \(T_{l_{i}}\) (i=1,2) corresponding to attenuation in dB at 3000 Hz.

3.2.4 Unvoiced source component

The unvoiced source component is computed using a Gaussian white noise \(\mathcal {N}\) filtered by a wide band-pass second-order filter NS to simulate the effect of flow turbulence at the glottis. Turbulent noise sources can be modeled by a high-pass filter with a small spectral tilt in high frequencies [29]. We chose a second-order Butterworth filter with 1000 and 6000 Hz as cutoff frequencies. This noise, with amplitude A n , is then modulated by the glottal flow derivative for mixed (noisy voiced) voice source qualities and added to it.

3.3 Vocal tract model

The vocal tract in Cantor Digitalis is computed with the help of a cascade/parallel formant synthesizer. This hybrid structure allows for fine adjustment of the voice spectrum and then fine control of the vowel quality, vocal tract size, and singer individuality.

3.3.1 Vocal tract resonances

The parallel components of the vocal tract filter are composed of six band-pass filters, each corresponding to one formant. Each formant is a 2-pole 2-zero digital resonator filter R i with transfer function (formant central frequency F i , formant bandwidth B i , and gain A i ,i∈ [ 1,6]) [30]:

$$ \mathsf{R_{i}}(z) = \frac {A_{i} \left(1- e^{-\pi B_{i} T_{s}}\right) \left(1-e^{- \pi B_{i} T_{s}} z^{-2}\right) } {1 - 2 e^{- \pi B_{i} T_{s}} \cos(2 \pi F_{i} T_{s}) z^{-1} + e^{-2 \pi B_{i} T_{s}} z^{-2}} $$
(6)

The first three formants contribute to the vowel identification. The remaining three formants contribute to voice timbre. The “singing formant,” described in the analysis of lyric voices, can be produced by grouping the third, fourth, and fifth formants [31].

3.3.2 Hypo-pharynx anti-resonances

In the spectrum of natural voice, one can observe anti-resonances at approximately 2.5–3.5 kHz and 4–5 kHz for vowels. The presence of the hypo-pharynx, composed of the laryngeal cavity and the bilateral piriform sinuses in the lower part of the vocal tract, appears to be primarily responsible for creating the anti-resonances. These spectral valleys are a clue for speaker identification but do not vary much between vowels of the same person [32, 33].

In Cantor Digitalis, an anti-formant is disposed in cascade after the parallel formant filters. A second-order fixed anti-resonance (BQ box in Fig. 2) is computed by a notch filter, a bi-quadratic second-order filter. Its transfer function is (quality factor Q BQ , anti-resonance frequency F BQ ) [30]

$$ \mathsf{BQ}(z)=\frac{1 + \beta_{BQ} z^{-1} + z^{-2}}{1+\alpha_{BQ} + \beta_{BQ} z^{-1} + (1-\alpha_{BQ})z^{-2}} $$
(7)

where

$$ \alpha_{BQ} = \frac{\sin(2\pi F_{BQ} T_{s})}{2 Q_{BQ}} $$
(8)
$$ \beta_{BQ} = - 2 \cos(2\pi F_{BQ} T_{s}) $$
(9)

Finally, all the parameters of the synthesizer are summarized in Table 4.

Table 4 Low-level synthesis parameters

4 Voice dimensions to parameter mapping

In this section, the mapping between voice dimensions and synthesis parameters is detailed. Recall that voice dimensions are managed by the actions of the player on the chironomic interface and the GUI. Both interfaces can be used simultaneously and in real time. The chironomic interface is preferred for fast musical actions, such as playing notes of a melody, and the GUI is preferred for slower action, such as creating one’s own voice character. The interplay between voice dimensions is rather intricate for some parameters in Cantor Digitalis. This is because as much knowledge as possible from the singing voice analysis literature has been incorporated in the mapping procedures, including formant tuning, vocal effort modeling, periodicity perturbations, voice mechanism modeling, and voice type settings.

4.1 Fundamental frequency

The fundamental frequency f 0 is mainly driven by the pitch voice dimension P defined by the stylus as well as by the pitch offset P 0 and several perturbations.

4.1.1 Pitch control

Pitch perception is very accurate, with the threshold for absolute pitch discrimination being in the case of synthetic vowels of approximately 5 to 9 cents (vowels with f 0 = 80 or 120 Hz) [34]. Considering the dimensions and resolution of the tablet, mapping approximately 3 octaves (35 semitones) of pitch to the X-axis corresponds to a pitch resolution of 0.08 cents for the smallest spatial step on the tablet (0.005 mm). Practically, the stylus tip width of approximately 0.25 mm corresponds to ±4 cents; this allows for an accuracy under the different limens for pitch perception.

In addition to the tablet X dimension, pitch is computed according to the pitch range of a given voice because only 35 semitones (ST) are represented on the tablet. The pitch offset parameter P 0 defines the pitch of the leftmost note on the tablet in semitones. P 0=69 corresponds to a fundamental frequency of 440 Hz. The absolute pitch linked to the control P abs is (in semi-tones)

$$ P_{\text{abs}} = P_{0} +35 P $$
(10)

4.1.2 Jitter

Jitter, i.e., random perturbation of f 0, is useful for obtaining a hoarse voice quality. It is computed as a percentage of f 0 (in Hz) and is controlled by the roughness R voice dimension. In normal voices, jitter is generally less than 1%. However, in pathological voices, jitter can be as large as 5% (i.e., almost a tone) [35]. A maximum of 30% jitter is set here to go beyond the human limit. Jitter is computed with the help of a centered random Gaussian noise generator \(\mathcal {N_{R}}\) with unity variance.

4.1.3 Long-term f 0 perturbations

In addition to additive and structural noises, slow and small amplitude random perturbations of the source contribute to a more lively quality of the sound. These perturbations are due to the heartbeat and muscular instabilities [3537].

Of course, in the case of expert singers, minimal perturbation is expected. Nevertheless, a small amount of perturbations still remain in the most experienced singer and may add naturalness to the synthetic voice. The perturbation due to the heartbeat can be identified in the sound pressure level and f 0 curves of natural voices. f 0 (in Hz) fluctuates up to 1% during a heartbeat cycle, and the voice amplitude fluctuates between 3 to 14%. Both depend on the mean vocal effort [38]. The additive amplitude and f 0 perturbation terms p heart are modeled on a cardiac cycle as follows (with β=0.001 ms−1 damping coefficient, A heart amplitude depending on vocal effort, and f c heartbeat frequency typically set to 1 Hz):

$$ p_{\text{heart}} = A_{\text{heart}} e^{-\beta t} \left\{ \begin{array}{ll} cos\left(8 \pi f_{c} t - \frac{\pi}{2}\right)& \text{for}\, t\in \left[0;\frac{1}{4 f_{c}}\right]\\ cos\left(4 \pi f_{c} t + \frac{\pi}{2}\right) & \text{for}\, t\in \left[\frac{1}{4 f_{c}};\frac{1}{ f_{c} }\right]\\ \end{array} \right. $$
(11)

When applied to f 0,A heart is set to 0.15 semitone for low vocal effort (E=E thr ), 0.01 semitone for high vocal effort (E = 1) and is logarithmically interpolated for other values of E.

Other perturbations can be added. Under a f 0 variation of 5 Hz, they are not sufficiently slow to be considered as a controlled intonation fluctuation but sufficiently slow to be perceived as a pitch variation [37]. We added a pink noise to the pitch, whose amplitude is empirically limited to 0.2 semitone for low vocal effort (E=E thr), to 0.01 semitone for high vocal effort (E=1), low-passed at the cut-frequency of 5 Hz, and independent from f 0 (named p slow). The filter output is reset every two cardiac cycles to avoid a deviation that is too high for a singing context.

In summary, f 0 is computed as follows (in Hz):

$$ f_{0} = 440 \cdot 2^{(P_{0} +35 P + p_{\text{heart}} + p_{\text{slow}} - 69)/12} (1 + 0.3 R \mathcal{N_{R}}) $$
(12)

4.2 Voice source

The voice source parameters (amplitude of noise A n , amplitude of source A g , formant center frequency F g , bandwidth B g , and spectral tilts \(T_{l_{1}}\) and \(T_{l_{2}}\)) are computed as functions of the voice parameters P, T, E, B, and R.

4.2.1 Long-term voice amplitude perturbations

Similar to f 0, long-term perturbations also affect the sound level, from 3 to 14% deviation on amplitude voice signal [38]. However, for the perturbations to modify all the variables related to vocal effort (\(A_{n}, A_{g}, F_{g}, B_{g}, T_{l_{1}}\phantom {\dot {i}\!}\), and \(T_{l_{2}}\)), they are applied at the output of control parameter E and not directly on A g :

$$ E_{p} = E +p_{\text{heart}}+p_{\text{slow}} $$
(13)

The heart perturbation p heart has the same form as that for the f 0 perturbation (Eq. 11) with A heart set to 0.1 for low vocal effort (E=E thr), 0.02 for high vocal effort (E = 1) and is logarithmically interpolated for other values of E. The mathematical expression p slow of long-term perturbations over E is chosen identically as for the one over f 0 (see above).

p slow is empirically limited to 0.08 for low vocal effort (E=E thr), and to 0.015 for high vocal effort (E=1) [sound examples in Additional files 1 and 2]6.

4.2.2 Glottal formant central frequency and bandwidth

The glottal formant has a major influence on the relative amplitudes of the first harmonics. Its characteristics (center frequency and bandwidth) depend on the shape of the glottal flow derivative model, given by the glottal formant frequency F g and bandwidth B g . In CALM, the latter can be defined as a function of the open quotient O q and asymmetry coefficient α m [28]:

$$ F_{g}=\frac{f_{0}}{2O_{q}} $$
(14)
$$ B_{g}=\frac{f_{0}}{ O_{q} \tan(\pi (1 - \alpha_{m}))} $$
(15)

O q and α m are expressed from the tension T and the perturbed vocal effort E p (defined in Section 4.2.1). Furthermore, to distinguish between chest and falsetto registers, respectively produced in laryngeal vibratory mechanisms M=1 and M=2, two expressions of O q and α m are given:

$$ O_{q} = \left\{ \begin{array}{ll} 10^{-2 (1 - O_{q_{0}}) T} & \text{if}\, T \leq 0.5\\ 10^{2 O_{q_{0}} (1 - T) - 1} & \text{if}\, T > 0.5\\ \end{array} \right. $$
(16)

where \(O_{q_{0}} = 0.903 - 0.426 E_{p}\) for M=1 and \(O_{q_{0}} = 0.978 - 0.279 E_{p}\) for M=2, and

$$ \alpha_{m} = \left\{ \begin{array}{ll} 0.5 + 2 (\alpha_{m_{0}} - 0.5) T & \text{if}\, T \leq 0.5\\ 0.9 - 2 (0.9 - \alpha_{m_{0}}) (1 - T) & \text{if}\, T > 0.5\\ \end{array} \right. $$
(17)

where \(\alpha _{m_{0}}=0.66\) for mechanism M=1 and \(\alpha _{m_{0}}=0.55\) for mechanism M=2. Figures 3 and 4 show the evolution of O q and α m as functions of T and E p for each laryngeal vibratory mechanism. α m is limited to 0.51 such that B g does not reach 0 in the computer program.

Fig. 3
figure 3

Evolution of open quotient O q according to vocal effort E and tension T for the two mechanisms

Fig. 4
figure 4

Evolution of asymmetry coefficient α m according to tension T for the two mechanisms

The constants are chosen such that the standard ranges for O q (for T=0.5) are [0.3,0.8] for M=1 and [0.5,0.95] for M=2 [39], and 0.66 for α m for M=1 and 0.55 for M=2. When the tenseness T decreases to 0, O q increases to 1 and α m decreases to 0.5. When the tenseness T increases to 1, O q decreases to 0.1 and α m increases to 0.9, and the values are set here to go beyond the human limit. The exponential function in Eq. 16 is deduced from Henrich et al. [40], who shows that the perception of O q variations is proportional to O q .

From these expressions and Eq. 12, one can compute the center frequency F g and the bandwidth B g as a function of P, P 0, M, T, and R, with Eqs. 1415. The glottal formant center frequency F g is generally situated below the first vocal tract formant F 1 and in the area of f 0. Moreover, it is proportional to pitch and depends on tenseness and vocal effort. As a simple rule, an increase in vocal effort and/or tenseness results in an increase of the glottal formant center frequency. Note that the effect of tenseness is larger on F g (and B g ) and that it changes only the glottal formant center frequency and bandwidth, while E p also changes the spectral tilt (see below).

As O q varies between approximately 0.1 and 1, the variation of the glottal formant center frequency is between approximately F g ≃0.5f 0 for the laxest voice quality and F g ≃5f 0 for a very tense voice (see Fig. 5).

Fig. 5
figure 5

Evolution of glottal flow derivative spectrum according to E and T. In green are the spectra of the glottal formant only. The glottal formant filter’s frequency response is indicated with a thick green line as the envelope of the glottal formant’s spectrum. In blue is the overall glottal spectrum including glottal formant and spectral tilt. The overall frequency response is represented with a thick blue line as the envelope of the overall glottal spectrum. Top row: low vocal effort (E=0.2); bottom row: high vocal effort (E=0.9); left column: low tension (T=0.2); right column: high tension (T=0.8). These spectra have been calculated for M=1 and f 0=100 Hz

The glottal formant bandwidth B g influences the first harmonic amplitudes relative to the higher harmonics. As α m varies between approximately 0.51 and 0.9, the variation of the glottal formant bandwidth is between approximately B g ≃0.03f 0 for the laxest voice quality and B g ≃31f 0 for a very tense voice (see Fig. 5).

4.2.3 Voice spectral tilt

Pressure on the stylus controls the vocal effort, the main influence of which is to change the voice spectral tilt. The spectral tilt is inversely proportional to the pressure (i.e., a strong pressure corresponds to a low spectral tilt or a boost in high frequency). Spectral tilt is controlled by a series of two low-pass filters driven by two parameters \(T_{l_{1}}\) and \(T_{l_{2}}\) (see Eqs. 3 to 5). The spectral tilt parameter \(T_{l_{1}}\) varies in mechanism M=1 (resp. M=2) between 6 dB (resp. 9 dB) for a minimum tilt and maximum effort and 27 dB (resp. 45 dB) for a minimum effort and maximum tilt, whereas the spectral tilt parameter \(T_{l_{2}}\) varies in mechanism M=1 (resp. M=2) between 0 dB (resp. 1.5 dB) for a minimum tilt and maximum effort and 11 dB (resp. 20 dB) for a minimum effort and maximum tilt.

$$\begin{array}{@{}rcl@{}} T_{l_{1}} &= &\left\{ \begin{array}{ll} 27 - 21 E_{p}~\text{dB} & \text{for}\, M = 1\\ 45 - 36 E_{p}~\text{dB} & \text{for}\, M = 2 \end{array} \right. \end{array} $$
(18)
$$\begin{array}{@{}rcl@{}} T_{l_{2}} &= &\left\{ \begin{array}{ll} 11 - 11 E_{p}~\text{dB} & \text{for}\, M = 1\\ 20 - 18.5 E_{p}~\text{dB} & \text{for}\, M = 2 \end{array} \right. \end{array} $$
(19)

The sum of \(T_{l_{1}}\) and \(T_{l_{2}}\) corresponds to the attenuation (dB) of the glottal flow derivative at 3000 Hz.

Figure 5 displays the effect of vocal effort E and tension T on the glottal flow derivative spectrum with and without spectral tilt for 4 pairs of (E,T) values corresponding to low/high tensions and low/high vocal effort [sound examples in Additional files 3 and 4]7.

4.2.4 Voicing amplitude and shimmer

Voice sound production occurs when the air flow of the lungs exceeds a phonation threshold. This threshold represents the sub-glottal pressure required to start vocal fold vibration. Thus, the sound level for a vowel cannot be arbitrary small: there is a minimum amplitude step in vocal effort between silence and phonation with an hysteresis effect between starting and ending of phonation. The phonation threshold is set as E=E thr=0.2 [sound examples in Additional files 5 and 6]8.

Shimmer, i.e., random perturbation of A g , is found in hoarse voice quality. A small amount of perturbations can be incorporated in voice amplitude A g computation. Shimmer is computed as a percentage of A g and controlled by the roughness R voice dimension. Although typical values are approximately 2.3% for a normal voice [41], for simulation of very rough voices, a maximum of 100% shimmer on A g is allowed in the system. Shimmer is computed with the help of a centered random Gaussian noise generator \(\mathcal {N_{R}}\) with unity variance.

One can show that changing O q , hence, changing F g and B g , in Eq. 2 has an effect on A g . Then, A g must be normalized by O q . Additionally, a correlation between E and A g is introduced. This is because in natural voice, the sound pressure level (SPL) depends on E. The chosen reference for sound level as a function of vocal effort comes from [42], extrapolated to sung voice: SPL ≃39E+60(dB), i.e., approximately 40 dB between low and high vocal efforts. The parameter C Ag =0.2 in Eq. 20 represents the signal amplitude at the phonation threshold value. It can be modified from the GUI.

Finally, A g can be computed as follows (where p hon is a binary function equal to 1 if phonation is present or 0 if there is no phonation):

$$ \begin{aligned} A_{g}= \left\{ \begin{array}{ll} 0 & \text{if}\, E_{p} \leq E_{\text{thr}} - 0.05 p_{\text{hon}} \\ \left((1- C_{Ag}) \frac{E_{p}-E_{\text{thr}}}{1-E_{\text{thr}}} + C_{Ag}\right) (1 + R \mathcal{N_{R}}) / O_{q} & \text{if}\, E_{p} > E_{\text{thr}} - 0.05 p_{\text{hon}} \end{array} \right. \end{aligned} $$
(20)

Note that although aspiration noise is modulated by A g , it is not equal to 0 for E p E thr−0.05p hon. Indeed, aspiration noise is expected to be produced below the phonation threshold of vocal effort (for 0<E p <E thr−0.05p hon). Then, the aspiration noise modulation is set to \(\left ((1- C_{Ag}) \frac {E_{p}-E_{\text {thr}}}{1-E_{\text {thr}}} + C_{Ag}\right) (1 + R \mathcal {N_{R}})/ O_{q}\) regardless of E p . For simplicity, it is not mentioned in Fig. 2 or Eq. 1.

4.2.5 Noise amplitude

The breathiness dimension B directly controls the A n parameter, i.e., noise amplitude or the amount of aspiration or breath noise in the voice source. Voicing can be switched off by a voiced-unvoiced command. This allows for breathy vowels without any periodic component. The relation between B and A n is directly given by:

$$ A_{n} = \left\{ \begin{array}{ll} B & \text{if voicing is on} \\ 1.5 E_{p} B & \text{if voicing is off} \end{array} \right. $$
(21)

When voicing is on, the factor 1 is empirically set to have a maximum signal-to-noise ratio of approximately −12 dB for standard voices. When voicing is off, a dependency on E p is added to enable control over the loudness of the signal.

4.3 Vocal tract formants

4.3.1 Generic formant values

Almost all the chironomic or GUI control parameters have an effect on vocal tract formants: vowel, voice quality dimensions, pitch and vocal effort. The different voices are computed using generic formant center frequencies \(F_{i_{G}}, -3\)-dB-bandwidth \(B_{i_{G}}\), and amplitude \(A_{i_{G}}\) (i∈[1,6], G stands for “generic”) reported in Table 3 for H,V={0,0.5,1}. These values have been measured for a tenor voice singing at a comfortable pitch and vocal effort level. Note that formant values measured for other singers can also be used and can be easily edited using the GUI.

The ten chosen vowels (/i,y,u,e,ø,o,3,œ,O,a/) are sufficient for computing the entire vocalic space. The other vowels (i.e., H,V≠{0,0.5,1}) are computed using a 2-D interpolation between the four closest canonical vowels in the space formed by the vowel height H and vowel backness V dimensions. These values are defined for any H,V∈[0,1]. H and V are controlled by the finger position of the non-preferred hand on a vocalic triangle printed at the top-left corner of the graphic tablet.

4.3.2 Vocal tract length

The vocal tract length is an important factor that influences vocal identity. Male vocal tracts are on average longer than female vocal tracts because of anatomical differences. The longer the vocal tract is, the lower its formant frequencies. Voices corresponding to different vocal tract sizes are created by multiplying the formant central frequencies by the same factor. The vocal tract size parameter S is mapped to a vocal tract scale factor α S ranging from 0.5 to 2.2 with the linear equation:

$$ \alpha_{S} = 1.7 S + 0.5 $$
(22)

4.3.3 Larynx position adaptation to f 0

A modification of approximately 10% of the formant positions is noticeable between f 0 = 200 Hz and f 0 = 1000 Hz [43]. This is achieved by multiplying the central frequency of all formants by a factor K, depending on f 0, with K(f 0=200 Hz)=1 and K(f 0=1000 Hz)=1.1 (which is equivalent to modifying the length of the vocal tract):

$$\begin{array}{@{}rcl@{}} \begin{array}{ll} K = 1.25 \cdot 10^{-4} f_{0}+ 0.975\\ \end{array} \end{array} $$
(23)

Vowel height H and vowel backness V are used to find the closest vowel of the generic voice. Then, the formant center frequencies are obtained by the generic formant values \(F_{i_{G}}\) (i∈[1,6]) from Table 3, scaled by the vocal tract size scale factor α S and larynx position factor K:

$$ F_{i} = K \alpha_{S} F_{i_{G}}(V,H) \text{for}\, i \in \left[1,6\right] $$
(24)

4.3.4 First formant tuning

The main control parameter for F 1 is vowel height H. In the vocalic triangle, F 1 represents the vertical dimension. However, vocal backness V also has a slight influence. In addition to Eq. 24, the first formant center frequency depends on other parameters: f 0 and vocal effort E.

In speech, increased vocal effort results in a higher first formant frequency F 1. F 1 increases at a rate of approximately 3.5 Hz/dB on average for French oral and isolated vowels [44]. Extrapolating this result for singing voice, a rule for automatic F 1 and vocal effort dependency is implemented. For our generic tenor voice at f 0=200 Hz, the sound level varies by approximately 40 dB between E=1 (maximum value) and E=E thr (phonation threshold). The generic \(F_{1_{G}}\) for this voice corresponds to a medium vocal effort \( \left (E = \frac {1-E_{\text {thr}}}{2} \right)\). Then, for a 3.5 Hz/dB increase, the dependency rule between F 1 and E must satisfy \(F_{1} \left (E=\frac {1-E_{\text {thr}}}{2} \right) = K \alpha _{S} F_{1_{G}}\), and F 1(E=1)−F 1(E=E thr)=40×3.5 Hz. This corresponds to the term “\(\frac {140}{1-E_{\text {thr}}}E - 70\) Hz” in Eq. 25. Note that as in natural voices, the vowel identity tends to disappear for high pitch, with all vowels becoming close to each other [sound example in Additional files 7 and 8]9.

Singers can adapt their two first vocalic formants as a function of f 0 and its harmonics to exploit the vocal tract resonances as much as possible. The effect is to increase the sound intensity [43, 45]. Soprano singing /A,o,u, ε/ vowels with a low vocal effort tend to adjust their first formant with the first harmonic f 0 10. The first formant is tuned to the first source harmonic F 1=f 0+50 Hz above a pitch threshold. Of course, for very high f 0, formant tuning is no longer possible because the fundamental frequency is well above the possible first formant frequency. In summary, F 1 is computed according to the following equation:

$$ \begin{aligned} {} F_{1} = \max \left(f_{0}+50~\text{Hz}, K \alpha_{S} F_{1_{G}}(V,H) + \frac{140}{1-E_{\text{thr}}} - 70~\text{Hz} \right) \end{aligned} $$
(25)

4.3.5 Second formant tuning

The main control parameter for F 2 is vowel backness V, which is the horizontal dimension in the vocalic triangle. The vocal tract length factor α S modifies the formant frequency proportionally.

For high pitched voices, 2f 0 and F 2 can come close together. In this case, there is some evidence of vocal tract resonances tuning as a function of f 0. Soprano singing /A,o,u, ε/ vowels with a low vocal effort tend to adjust their second formant to the second harmonic 2f 0 [10]. The second formant is tuned to the second source harmonics F 2=2f 0+50 Hz above a pitch threshold. For very high f 0, formant tuning is no longer possible because the second harmonic is well above the second formant frequency. In summary, the second formant center frequency F 2 is computed as a function of vowel backness V, vowel height H, vocal tract scale factors α S and K, and f 0 [sound examples in Additional files 9 and 10]11:

$$ F_{2} = \max \left(2\,f_{0}+50~\text{Hz}, K \alpha_{S} F_{2_{G}}(V,H) \right) $$
(26)

4.3.6 Formant bandwidths

Formant bandwidths for any vowel are obtained from generic values \(B_{i_{G}}\) (given in Table 3 for canonical vowels), interpolated using vowel height H and vowel backness V.

4.3.7 Formant amplitudes

As for center frequencies and bandwidths, formant amplitudes A i (i∈[1,6]) are obtained by interpolation of the values in Table 3 using vowel height H and vowel backness V.

These values must be corrected depending on f 0. In parallel formant synthesis, the coincidence of f 0 or its harmonic with formant center frequencies is likely to produce artifacts. A sharp resonant filter with a narrow bandwidth is likely to amplify source harmonics too much when multiples of the fundamental frequency f 0 match with the formant frequency F i (i∈[1,6]). In natural voice, this effect is occasionally searched for (e.g., in diphonic singing). To correct possible outstanding harmonics, the first three resonant filter amplitudes A i (i∈[1,3]) are decreased automatically and progressively when the closest kth harmonic (k∈[0,7]) of f 0 is becoming closer to the central frequency F i of the resonant filter i [sound examples in Additional files 11 and 12]12:

$$ \begin{array}{llll} \multicolumn{4}{l}{\text{if}\, |(k+1)\, f_{0}-F_{i}| < \Delta F_{i}\text{:}}\\ & \multicolumn{3}{l}{A_{i} = A_{i_{G}} - \left(1 - \frac{|(k+1)\, f_{0}-F_{i}|}{\Delta F_{i}} \right) \text{Att}_{\text{max}_{i}}} \\ {\text{else:}} & \multicolumn{3}{l}{A_{i} = A_{i_{G}} }\\ \end{array} $$
(27)

Δ F i is the frequency interval around the formant central frequency where the attenuation is applied, and it is a linear function of f 0. Its values typically range from 15 to 100 Hz for f 0 from 50 to 1500 Hz. \(\text {Att}_{\text {max}_{i}}\phantom {\dot {i}\!}\) is the attenuation amplitude at the formant central frequency F i and is a linear function of f 0. Its values typically range from 10 to 25 dB for f 0 from 50 to 1500 Hz. All these values have been set empirically. For higher order harmonics, no correction is needed because artifacts are not perceived.

4.3.8 Anti-formants

A quality factor of 2.5 and a central frequency of 4700 Hz are used for the generic voice. The piriform sinus shape appears to be person dependent [32]. As the vocal tract size is likely to change the piriform sinus size, the central frequency of the piriform sinus anti-resonance is also multiplied by the vocal tract size scale factor α S .

5 Results and discussion

In this section, the evaluation of Cantor Digitalis is presented. Following objective evaluation for melodic accuracy and precision, sound quality and musical use are demonstrated with the help of didactic videos, live performance videos, and audio demonstrations for typical voices built with the synthesizer. Applications and the software distribution are presented before the conclusions and perspectives.

5.1 Evaluation of melodic accuracy and precision

Assessment of melodic precision and accuracy in singing using Cantor Digitalis compared to natural singing has recently been reported in a companion paper [17]. The reader is referred to this publication for details on the evaluation; only the main results are summarized here.

Melodic accuracy and precision were measured for a group of 20 subjects using a methodology developed for singing assessment [46]. The task of the subjects was to sing ascending and descending intervals and short melodies as well as possible. Three singing conditions were tested: chironomy (Cantor Digitalis), mute chironomy (Cantor Digitalis, but without audio feedback), and singing (i.e., the subjects’ own natural voice). The mute chironomy condition was used for studying the role played by the different (audio, visual and motor) modalities involved when playing Cantor Digitalis.

All the subjects showed comparable proficiency in natural and Cantor Digitalis singing, with some performing significantly better in chironomic singing. Note that for a majority of the subjects, this test was the first contact with Cantor Digitalis. Thus, trained players are likely to obtain even better results. However, professionally trained singers would most likely also outperform chironomic singers.

Surprisingly, for chironomic conditions, the subjects performed equally well with or without audio feedback: both conditions do not show any significant difference. This result was further investigated in a complementary study [47], showing a generally high visuo-motor ability among subjects and the dominance of vision on audition in targeting visual and audio targets: the subjects rely considerably on visuo-motor skills for playing Cantor Digitalis. This situation is somewhat similar to keyboard playing, where the musician can play with a comparable precision on a mute keyboard.

Note that in the current version of the software, an intonation correction algorithm is also available [21].

5.2 Playing with Cantor Digitalis

Using the 2D tablet surface is preferred for expressive melodic control (in principle, only a 1D parameter). An example of an X-Y trace in time for a simple melody is presented in Fig. 6. Pitch vibrato corresponds to the circles around the notes, while pitch transitions correspond to the larger curves linking the notes. An example of virtuoso melodic gestures is provided in an additional video file [see Additional file 13].

Fig. 6
figure 6

Trace on the tablet of the melody CEGD played with vibrato. The red arrows indicate time

Gestures for vocal efforts and voice quality variations are also intuitively produced by the player. A video example shows two musical sentences with low and high vocal efforts [see Additional file 14]. Gestures for playing vowels and semi-vowels are shown in a third video example [see Additional file 15]. Spectrograms of vocalic variations for whispered speech are shown in Fig. 7. It is also possible to play with the GUI in real time. An example of changing vocal tract size is provided in an additional sound file [see Additional file 16].

Fig. 7
figure 7

Spectrograms of different voice types, with pitch (blue thin line). Top: bass voice [sound example in Additional file 20]. Second: Bulgarian soprano voice [sound example in Additional file 26]. Third: whispered voice [sound example in Additional file 28]. Bottom: bell-like vocal impulses [sound example in Additional file 34]

As a parametric synthesizer, Cantor Digitalis is not limited to a specific voice. On the contrary, all voice types or other sounds close to the vocal model can be designed. The vocal individuality of a singer results in a specific combination of formants, pitch range, and voice qualities.

The base formant values, measured for a tenor voice, are extrapolated to produce voices with a different mean vocal tract size. A factor smaller than one increases the vocal tract size, such as for a bass singer, whereas a factor greater than one decreases the vocal tract size, such as for soprano or female alto voices. Baby voices are built with a very short vocal tract size, whereas giant voices are built with a very large vocal tract.

Voice source parameters must also be adjusted to create different voices: laryngeal vibratory mechanism, vocal tension, hoarseness, and breathiness. This is demonstrated in additional sound files with dynamic parameter modification on an ascending and descending pitch scale [sound examples in Additional files 17, 18, and 19].

Cantor Digitalis offers voice presets for different vocal types, such as the western classical vocal quartet (bass, tenor, alto and soprano [sound examples in Additional files 20, 21, 22, 23, 24, and 25]) or folk Bulgarian soprano [see Additional file 26]. “Baby” (short vocal tract, very high pitch [sound examples in Additional files 27 and 28]) or “giant” (long vocal tract, low pitch, [sound examples in Additional files 29 and 30]) voices are obtained by pushing some parameters beyond their natural boundaries. Vocal sounds can be turned in wind-like (tense voiced vowels [sound example in Additional file 31] or unvoiced vowels [sound example in Additional files 32 and 33]) or bell-like (very low pitch, vocal tract impulse responses [sound example in Additional file 34]) sounds. All the parameters can be varied independently to build a new voice type. Other formant parameters can also be used for the generic voice.

The parameter values of the voices used for the sound and video examples are presented in Table 2. Figure 7 presents spectrograms of lyric bass voice, Bulgarian soprano voice, whispered voice, and bell-like vocal impulses.

5.3 Chorus Digitalis and voice factory

The effectiveness of Cantor Digitalis as a musical instrument has been demonstrated during several successful concerts by the Chorus Digitalis 13, a choir of Cantor Digitalis. Each musician plays one Cantor Digitalis on her/his laptop with a dedicated loudspeaker for each voice, located just behind each player. Concert video excerpts are associated with this paper (North Indian vocal style [see Additional files 35 and 36]; Opera vocal style [see Additional file 37]; modern vocal style [see Additional file 38]; Bulgarian vocal style [see Additional file 39]; and performing laughs [see Additional file 40]).

Another application of Cantor Digitalis is the Voice Factory software [4], included in Cantor Digitalis. Thanks to this educational tool, the main concepts in the field of voice production can be manipulated and heard in real time. Voice source parameters, formants, and source/filter dependencies can be listened to separately or in combination. However, the most important feature is dynamic control through user gestures during the construction and deconstruction of the voice model, providing an interactive and instructive audio-visual tool. This tool has been used in various contexts: science festivals, classes in several universities, and elementary schools.

5.4 Software implementation and distribution

Cantor Digitalis is implemented in Max14 15. It is distributed under an open-source CeCILL license (a GPL-like license designed by CNRS). Interested readers are able to find all the details of the implementation directly in the software documentation and patches at the following addresses: http://cantordigitalis.limsi.fr or https://github.com/CantorDigitalis/.

The code sources are given in Max 6. It is composed of a main Max patch calling Max abstractions. The main Max patch follows the source-filter structure with several sub-patches addressing the rules and parameter mappings. Table 5 presents the list of sub-patches and their references to the corresponding sections of this article. External open-source codes are used, particularly the s2m.wacom and s2m.wacomtouch Max objects16 (CeCILL license) allowing to receive the tablet data in Max.

Table 5 List of Max patches, with short description and reference to the corresponding sections in the text

Although the continuous surface appears very adapted for Cantor Digitalis, it is possible to plug in any MIDI interface. MIDI piano keyboards with pedal and wheel controls have been tested and allow music to be played that requires fast phrases, which is more difficult with the pen tablet.

The robustness of the software implementation has been practically assessed by an important number of downloads to date. The code has already been ported by developers outside our research group to other musical interfaces, such as the Haken Continuum17 [48], the Madrona Labs Soundplane18, and the ROLI Seaboard [49]19.

6 Conclusions

Cantor Digitalis is a successful chironomic parametric singing synthesis system. This article aims at presenting the scientific and technical design of this system. As described in the present article, Cantor Digitalis is limited to vocalic synthesis. However, consonant synthesis by rules in the same framework has also been developed [5, 19]. A bi-tablet version of Cantor Digitalis, the Digitartic system, has been demonstrated, with a limited number of consonants (French consonants except / R,l/). Adding consonants on a single tablet proved difficult because too many parameters have to be controlled by the player (pitch, vocalic space, place and manner of articulation, articulation phase, and intensity on attacks and vowels). As the resulting sound quality is generally inferior to that of vowels, one can consider that the question of articulation for consonant for future real-time singing instruments is still open.

Another important question is the automatic learning of specific voices. Statistical parametric learning, such as in modern text-to-speech technology, or other machine learning techniques could be used for incorporating specific voice characters with Cantor Digitalis.

7 Endnotes

1 http://cantordigitalis.limsi.fr/chorusdigitalis_en.php

2 http://guthman.gatech.edu/pastcompetitions

3A performative version of “Chant” had been proposed very early [50]

4Along this line, the Vocaloid system [51] witnessed phenomenal popular success.

5the correct b1 coefficient expression is given in [13]

6Additional files 1 and 2 are audio examples without and with the perturbations, for a medium vocal effort E=0.5.

7Additional audio files 3 (laryngeal mechanism M=1) and 4 (laryngeal mechanism M=2) illustrate the audio effects of Eqs. 16, 17, 18, and 19, with vocal effort E=0.8, tension T=0.5, and fundamental frequency f 0=280 Hz, Alto voice type.

8Additional audio files 5 and 6 are examples of crescendi and decrescendi, without and with phonation threshold. Vocal effort increases from 0 to 0.4 and then decreases from 0.4 to 0, with breathy voice (B=0.5). Note that breath noise remains for 0<E<E thr .

9Additional audio files 7 and 8 are crescendi without and with formant tuning. Parameter E increases linearly from 0 to 1.

10This effect was already in the CHANT program [6]

11Additional audio files 9 and 10 are without and with formant tuning, Eqs. 26, and 25, and larynx position adaptation, Eq. 24).

12Additional audio files 7 and 8 are glissandi without and with formant amplitude attenuation. Two whistling resonances are attenuated with the rule (beginning and middle of the sound)

13 http://cantordigitalis.limsi.fr/chorusdigitalis_en.php

14 http://cycling74.com/products/max/

15Max works under OS X and Windows, and s2m.wacom works with Max only on OSX 10.6 or later. Then, Cantor Digitalis can be used with all its features on Mac OS X and Windows, except for the graphic tablet control under Windows. On Windows, the current possible controls are the following: MIDI interface like piano keyboard with wheels and pedal; mouse and computer keyboard; and any other control from Max messages. Max Standalones compiled for Mac OS X and Windows are also provided.

16 http://metason.cnrs-mrs.fr/Resultats/MaxMSP/index.html

17See https://youtu.be/R2XRfhu95Dc

18See https://youtu.be/oVQMHX4bQuo

19See https://youtu.be/mC4pmokMwRo