Sigma-Lognormal Modeling of Speech

Human movement studies and analyses have been fundamental in many scientific domains, ranging from neuroscience to education, pattern recognition to robotics, health care to sports, and beyond. Previous speech motor models were proposed to understand how speech movement is produced and how the resulting speech varies when some parameters are changed. However, the inverse approach, in which the muscular response parameters and the subject’s age are derived from real continuous speech, is not possible with such models. Instead, in the handwriting field, the kinematic theory of rapid human movements and its associated Sigma-lognormal model have been applied successfully to obtain the muscular response parameters. This work presents a speech kinematics-based model that can be used to study, analyze, and reconstruct complex speech kinematics in a simplified manner. A method based on the kinematic theory of rapid human movements and its associated Sigma-lognormal model are applied to describe and to parameterize the asymptotic impulse response of the neuromuscular networks involved in speech as a response to a neuromotor command. The method used to carry out transformations from formants to a movement observation is also presented. Experiments carried out with the (English) VTR-TIMIT database and the (German) Saarbrucken Voice Database, including people of different ages, with and without laryngeal pathologies, corroborate the link between the extracted parameters and aging, on the one hand, and the proportion between the first and second formants required in applying the kinematic theory of rapid human movements, on the other. The results should drive innovative developments in the modeling and understanding of speech kinematics.


Introduction
For decades, human movement studies and analyses have been fundamental in many scientific domains, ranging from neuroscience to education, pattern recognition to robotics, health care to sports, and beyond.The primary goal of these studies has always been to parameterize and assess human movements, providing information on the basic processes involved in fine motor control and their variability.In speech, computational systems that synthesize and assess speech motor control provide answers to some questions regarding the articulator movements used by humans to produce speech sounds, speech rate effects, or for example, how infants acquire the motor skills needed to produce the speech sounds of their native language [1].However, many questions regarding the modeling and automatic assessment of natural neuromotor decline in healthy speech or the parameterization of neuromotor commands and muscular responses from fast continuous speech are still open.
Many neurocomputational models have been proposed to understand how speech movement is produced and how the resulting speech varies when changing some parameters [2].To this end, motor control models inspired by computer programs have often been used.Under this paradigm, motor commands are generated based on a central motor plan [2] and executed by a speech generator.Several speech motor models, such as the GEPETTO [3,4], ACT [5], DIVA [6] and Task Dynamics [7], State Feedback [8], and FACTS [9] models, have been developed in this context in recent years.These models start from an action plan (planner) and then adjust a set of parameters, moving a set of articulators, to get the ideal output (feedforward models).In some of them, the acoustic output signal is then compared with the reference input signal from the planner to generate an error signal that allows to correct the movement (feedback models).
Previous works have been oriented toward the modeling of learned speech of healthy speakers.Some of the above models (DIVA, FACTS, and ACT) can, following adjustments of some parameters, model certain aspects of development and aging.However, to the best of our knowledge, neuromotor decline has never been modeled by such systems [2].Moreover, the inverse approach, in which the muscular response parameters are derived from real continuous speech, is not possible with such models.
Among the models that study human movement production in general [10], the kinematic theory of rapid human movements and its associated Sigma-lognormal model [11][12][13] have been applied successfully in several fields [14] to model numerous human movements such as handwriting and signatures, as well as eye, finger, wrist, hand, head, and trunk movements [14][15][16][17].It has also been used to evaluate the effect of exercise on global neuromotor control [18], on the detection and monitoring of neuromuscular disorders, and to study and synthesize handwriting motor control changes in humans with age [19][20][21].The Sigma-lognormal model has thus demonstrated its capacity to obtain a muscular response and neuromotor command parameters from online handwriting, to assess neuromotor aging and synthesize new handwriting samples.
In handwriting, the Sigma-lognormal model decomposes a complex movement, obtained from the temporal trajectory captured with a digital tablet, into a sum of simple timeoverlapped primitives with a lognormal velocity profile.This method provides information about how every single movement is generated and synchronized, modeling the end effector (set of muscles involved in the movement) as a black box.Thus, the lognormal-shaped impulse response of the end effector, used as a primitive, is not linked to any specific articulation, but rather, to a large number of coupled subsystems.Moreover, the movement primitive is not necessarily confined to movements with a single velocity peak, as is still often assumed in many models [22].
Given the above advantages, in this paper, we propose a novel methodology based on the Sigma-lognormal model to parameterize the speech kinematics and the muscular response produced by the complex set of muscles involved in achieving the target sound, as well as to study aging effects.One question that does arise though is how the kinematic theory can be applied to speech modeling.The answer to this question is by no means straightforward.As a first proof of concept, preliminary works directly applied the kinematic theory of rapid human movements to diphthongs and sustained vowels uttered in neuromotor disease analysis [23][24][25][26], suggesting the possibility of applying the Sigma-lognormal model to speech.However, obtaining a general model would require a representation of a target's map, a trajectory mapping, and a velocity representation, all assuming a lognormal impulse response that would need to be related to some speech features.To address these issues, we assume a high-level goal as the target map (a map of sound that can be discriminated between them), inspired by the work on the spatial model proposed by Moser et al. [27,28], instead of a fixed desired position of each individual speech articulator.As such, the velocity representation can be obtained from the sound transitions (trajectory map).The model is explained in detail in "Sigma-Lognormal Parameterization Method".
To test the validity of the proposed method, we present two sets of experiments.The first one aims to illustrate the meaning of the lognormal decomposition in simple movements in a continuous speech signal.In the second one, the goal is to evaluate the model's ability to identify significant differences in some parameters when modeling aging in the speech of subjects with or without laryngeal pathologies.In certain studies related to handwriting [19], it has been observed that the time between lognormals and their number increases with age.Timing effects have also been reported in speech, where an fMRI study suggested that the motor control of timing during speech production declines with age [29].So, if the proposed Sigma-lognormal model describes the speech kinematics well, then we should expect results obtained in speech to be similar to those obtained in handwriting [19] if proper experiments are run.Moreover, since laryngeal dysfunction only affects the sound source (glottis), and not the global end effector movements, the time between lognormals should not be affected in this case, unlike in the case of aging.In the experiment section, these hypotheses will also be tested.
The present work is structured as follows.After an overview of the kinematic theory of rapid human movements in "Overview of the Sigma-Lognormal Model", "Sigma-Lognormal Parameterization Method" describes the method for estimating speech kinematics and how it is parameterized."Evaluation, Results, and Discussion" evaluates the model and discusses the results obtained.Finally, we summarize our findings in "Conclusions".

Overview of the Sigma-Lognormal Model
The Sigma-lognormal model explains how an action plan comprised a sequence of circumference arcs between virtual target points (VTP) can be activated to generate a spatiotemporal trajectory.Virtual target points are defined as the positions targeted by a lognormal, but that are not necessarily reached because of the temporal overlapping of the next lognormal [30].Virtual targets are thus related to the learning process and how the movement is programmed by the brain.A starting and an ending angle define each arc linking virtual target points.Each ending VTP is the starting VTP of the next arc.To generate smooth movements from this discontinuous action plan, the instantiation of a command at a given VTP must start before the previous stroke reaches that VTP.In other words, each arc has a starting time but finishes later than the starting time of the next one.Therefore, successive resulting strokes are temporally overlapped.Each arc is executed following a lognormal-shaped velocity curve, and the whole trajectory is made up of the vector summation of the individual strokes.
Mathematically, the lognormal velocity profile of a simple movement is defined by [7] where D j is the length of the movement, t oj is the time occurrence of the movement command, j is the log time delay, j is the log response time, and j indicates the index of the movement.The velocity profile of a complex movement �� ⃗ v r (t) is given by the time superposition of NbLog lognormals [9] as follows: where j (t) is the angular position, defined as and Θ sj (t) and Θ ej (t) are the starting and the end angular directions of the j th simple movement or stroke, and erf is the error function. (1) Finally, the trajectory is worked out as This expression converts angles into arcs of circumferences that are temporally overlapped.Specifically, the j th term of the summation represents the arc that links consecutive virtual target points, VTP j-1 and VTP j, which are defined by with T being the total temporal duration of the spatiotemporal sequence.
A sequence of virtual target points, along with their starting and ending angles and their lognormal velocity parameters, can be analytically extracted through reverse engineering (Fig. 1).Using the extracted action plan, the corresponding spatiotemporal sequence can be reconstructed from its set of parameters: Classically, these parameters are calculated from the sampled 2D spatiotemporal sequence with software such as ScriptStudio [31] or iDeLog [32].
Once the original velocity v o (t) has been reconstructed as a summation of lognormals ( �� ⃗ v r (t) ), the quality of the recon- struction can be evaluated using the signal-to-noise-ratio (SNR) between them.Specifically, the SNR is defined as [30] It is commonly accepted that when SNR < 15 dB, the reconstruction is not appropriate due to either ScriptStudio [31] or iDeLog [32] not having managed to find an adequate solution or to the spatiotemporal sequence not corresponding to the model [30].In the latter case, as the lognormal is accepted as a neuromotor model, we could also say that the spatiotemporal sequence does not correspond to the timing conditions under which lognormals emerge, as predicted by the central limit theorem [33].

Sigma-Lognormal Parameterization Method
The scheme for applying the Sigma-lognormal model in speech is presented in Fig. 2. The model divides the speech generation into two steps: planning of the sequence of sounds (effector-independent) and execution (4) of the sequence via the end effector (effector-dependent) [20,21,34].Firstly, in the effector-independent step, a sound map (higher-level goal) is defined, assuming that each simple learned sound has a corresponding position on a hexagonal grid.Note that in this map (different for each person), the targets are sounds, and not phonemes, since a phoneme can be defined either as a simple sound or as a group of different sounds.Processing a sequence of sounds (for example, [uiau] in (Fig. 2) involves moving through different positions on the grid and generating a trajectory through the selected sounds from a series of commands.Secondly, the effector-dependent module is linked to the neuromuscular system itself (end effector) and is defined by its impulse response to each command.The end effector movement causes the vocal track shape to vary, thus changing the resonance frequencies, and therefore, the formants (resonant frequencies of the oronasopharyngeal tract) over time [35][36][37].
In this work, we use a reverse engineering approach, starting from the formants and moving up to the sound trajectory; from variations in the formant, we estimate both the parameters that model the commands that determine the transition from one grid position to another (simple movement) and the muscular response to each command.Thus, we will model neither the initial positions on the grid nor the physical constraints of the vocal track.
To perform a Sigma-lognormal analysis of speech, a spatiotemporal sequence that globally represents the speech kinematics is required, as previously depicted in Fig. 2. To this end, we rely on the resonance tube model.Indeed, in speech synthesis, the vocal tract (from the glottis to lips) can be represented as a concatenation of lossless acoustic tubes, where the shape and the volume of the vocal tract vary for each sound [36,38].An increment or decrement of the section and length of the tubes produces a change in the resonance frequencies, and accordingly, a change of formants in the output speech, as we can see in Fig. 3.This means that each motor command that the brain produces to generate synchronous muscle movement required to go from one acoustic position to another changes the resonant cavities.Thus, a relationship can be established between an increment of the formant and the increment of the resonant areas or between the formant tracks and the movements of muscles [39].Then, if the estimated velocity is integrated, a kinematic trajectory to be analyzed by our model can be obtained.Note that the resonant cavities of each subject are different, depending on the morphology and length of the organs that comprise it.Therefore, when a language is learned, the articulatory position of each sound is set as a function of the resonant cavities needed to produce the sound closest to the ideal one the person is trying to learn, and of how the sound is perceived [40][41][42].
This kinematic trajectory of the formants can be considered as the movement of a reference center (RC) of a speech end effector over the acoustic space, much the same as the movement of the pencil tip over paper represents the movement produced by the end effector during handwriting.According to this analogy, the model parameters could be recovered from a speech signal in three steps: (1) tracking of the formants; (2) from the formant sequence, obtention of the end effector kinematics, and (3) parameterization of the resulting trajectory using the Sigma-lognormal model.

Formant Estimation
To estimate the speech kinematics from the acoustic space in a non-invasive fashion, the formants are evaluated from the speech recorded with a microphone.This procedure is similar to the one used in handwriting, where the movements of the pencil tip are captured with a digitizing tablet.
Formants can be tracked using many methods proposed in the literature.In this paper, we use some of the methods implemented in the PRAAT software [43] to ensure experimental repeatability and to test the dependence of the proposed methodology on the formant estimation procedure.
Since there are no clear formants for unvoiced consonants, in fluent speech, they are usually co-articulated with a voiced sound [35], and so we assume that the missing formant information can be interpolated as a movement from the positions of the previous and posterior voiced phonemes.

Formants to End Effector Kinematics
A speech kinematics can be computed from its speech formants since the formant track is related to the movement in the tube resonance model and its velocity.Usually, the first two formants of the voice (F 1 and F 2 ) can give a spatial representation of the most frequent sounds and can be used to estimate the movement of the end effector needed to go from one sound to another.As can be seen in Fig. 4 (left), increments or decrements in the first or the second formant are related to changes in the pronounced sound.These changes can be represented as a trajectory drawn on an imaginary axis (Fig. 4 (right)).Since the proportion of the contribution of the first and second formants to the kinematic space is an ill-posed problem [39,[44][45][46], a transfer coefficient c i is added to the mapping equation.Hence, the conversion from the acoustic space to the kinematics space can be approximated by a linear transform such as where F 1 (t) and F 2 (t) are the tracks in the first and second formants, c i are the transfer coefficients, x(t) and y(t) are the trajectories along the two imaginary axes in the kinematic space, x(t)∕ t denotes the derivative of the generic sequence x(t), and y(t)∕ t denotes the derivative of the generic sequence y(t).
Once x(t) and y(t) are calculated from the formants, the approximate velocity v f (t) is estimated as The end effector reference center (RC) trajectory can thus be obtained by integrating Eq. 8, which leads to: We assume that the initial conditions, which are irrelevant for the Sigma-lognormal analysis, are equal to zero.Then, x(t) and y(t) refer to the spatiotemporal sequence that represents the end effector movement.Thus, c 1 and c 2 can be seen as the weights that map the formants F 1 and F 2 into their spatial representation.To allow evaluating the proportion between F 1 and F 2 that can give more information regarding the articulatory movement to which the kinematic theory of rapid human movements should be applied, in this work, we novelly calculate these weights using ( 8) where k is the scale constant and α is the proportion parameter that defines the relative contribution of F 1 and F 2 to the kinematic space.It depends on the shape of the vocal tract.
To calculate α and k, we assume that the acoustic space could be transformed into a hexagonal kinematic space (Fig. 4, right).Based on previous handwriting synthesis studies [20], and inspired by the hexagonal grid cell distribution proposed by Moser et al. [27,28], the vowel triangle limits are fitted with an equilateral triangle.In this work, we hypothesize that α can be approximated as the value that keeps L 1 = L 2 , with L 1 being the distance between the /i/ position and /a/, and L 2 , the distance between the /i/ and /u/ (Fig. 4, right).
To this end, as the external vowels of the kinematic space are usually /a/, /i/, and /u/ (see Fig. 4), we define F 1a , the first formant of the vowel /a/, F 1i and F 2i , the first and second formants of the vowel /i/, respectively, and F 2u , the second formant of the vowel /u/.
The height of the triangle can be calculated as Considering the triangle as an equilateral triangle, it means that

And
As Replacing c 1 and c 2 by their values: Obtaining α (the proportion between F 1 and F 2 ): It should be noted that both the value of α and the formant values are speaker-dependent.Table 1 shows the α values obtained with Eq. 17 using the formant values given by Hillenbrand et al. in [47] (English vowels) and ( 11) ) by Pätzold et al. [48] (German vowels).We can see that the values range from 0.25 to 0.33.
The constant k is a scale factor that converts the estimated values of x(t) and y(t) to centimeters.Unlike with the proportion parameter α, this constant is not necessary for the Sigma-lognormal model.However, a reasonable value of k facilitates understanding of the model.
To find this reasonable value, we can use the already known relationship between L 2 and k given by: k is thus obtained as To calculate the k value associated with a real movement, the L 2 value can be obtained from the results presented by Whitfield et al. [49], where the movement needed to utter the sentence "It's time to shop for two new suits" was measured in 20 subjects with sensors.We take the values obtained with the tongue front marker (TF) in mm, the mean of the range of F 2 , and the parameter α rounded to 0.3.This leads to a k factor of about 0.04 mm/Hz, which keeps the peak velocities similar to the ones presented in [50].

Sigma-Lognormal Analysis
Once the trajectory has been estimated, it is modeled with the kinematic theory of rapid human movements through the Sigma-lognormal model, as is explained in "Overview of the Sigma-Lognormal Model".The kinematic theory is applied in an attempt to model the speech kinematics as a synchronized summation of simple overlapped movements, inspired by how the brain issues time-spaced commands to the articulatory organs.As such, speech is modeled as a global movement instead of a single muscle or group of muscles modeled independently.
The hypothesis underlining the application of this model to speech posits that a lognormal in speech has a similar meaning as in handwriting, a primitive that has been widely tried and tested.Therefore, in the case of speech, the number of lognormals would be related to the number of simple ( 18) articulatory movements for a natural and healthy speech.Hence, the number of lognormals should be related to the number of speech sounds uttered and their timing.Obviously, it is expected that a neuromotor dysfunction will affect the number, shape, and time of occurrence of the lognormals, as is the case in handwriting [19].These neuromotor dysfunctions can be due to normal aging or neurodegenerative diseases.In the special case of laryngeal pathologies, which affect only the closing of the glottis and voice source, they should not affect the timing parameters and the lognormal shape for subjects of the same age, but they could result in more simple movements due to the effort needed to talk and to the pauses in the pronunciation of a sentence.
Beyond the sequence of lognormal parameters P = D j , t oj j , j , Θ ej , Θ sj , VTP j -1 NbLog j=1 , it makes sense to define and use supplementary parameters related to the timing intervals between lognormals and lognormal shapes.Such parameters can help improve our understanding of some diseases.Examples of these parameters include: • Δt o : the mean of the time between successive lognor- mals, that is, the mean of the difference between the current lognormal and the previous one: • V p : the average of the maximum velocity of the Nblog lognormals: • : the mean of the log time delay: • : the mean of the lognormal response time: • D : the mean of the lognormal distance covered in the kinematic space: (20)

Evaluation, Results, and Discussion
The evaluation of the model is aimed at answering the following three questions: 1. What is the meaning of each lognormal in speech? 2. Which range of α (the proportion between F 1 and F 2 ) is adequate to apply the Sigma-lognormal model? 3. Do the speech lognormal parameters model aging phenomena in speech?

Databases
In handwriting, a lognormal expresses a primitive movement, related to a simple stroke.If a lognormal in speech retains a similar meaning, strokes should be associated with simple speech movements that are linked to the movements needed to pronounce a speech sound.To check this hypothesis, we used the VTR-TIMIT database [51,52].The advantage of this database is that the formants it contains have been manually annotated and the phonemes labeled, providing the background that allows correlating lognormals to phonemes.The VTR-TIMIT [51] database is composed of 538 English utterances from the TIMIT corpus [52], with phonetically compact sentences (SX) and phonetically diverse sentences (SI).The VTR-TIMIT database is labeled by phonemes.In this experiment, we use the complete dataset of 197 speakers and 538 utterances in total.The database is balanced in terms of speakers, dialects, gender, and phonemes [51].
Furthermore, as the Sigma-lognormal model links lognormals to the impulse response of a neuromotor system, it is assumed that only neurodegenerative or neuromotor diseases will affect the lognormal parameters in fluent speech.To assess this premise, we used the Saarbruecken Voice Database [53].This database contains healthy speakers as well as speakers with laryngeal pathologies.The database is labeled with the speaker's age and the kinds of pathologies they have.
The Saarbruecken Voice Database [53] is a collection of German speech recordings from more than 2000 speakers.The sentence recorded is "Guten Morgen, wie geht es Ihnen?" ("Good morning, how are you?").For our experiments, we divided this database into three groups: • Young speakers' group, which encompassed speakers aged between 20 and 30.It included both healthy speakers and speakers with laryngeal pathologies.There was a total of 609 speakers, with 236 males and 373 females.
• Middle speakers' group, which encompassed speakers aged between 40 and 50, containing healthy speakers and speakers and with laryngeal pathologies.There were a total of 352 speakers: 177 males and 175 females.• Older speakers' group, which included all speakers aged 60 to 80 years old.All in all, there were 466 speakers in this group: 262 males and 204 females.
All the recordings were made in a controlled environment at a sampling frequency of 50 kHz and a 16-bit resolution.The recordings contained 71 different laryngeal pathologies, including some organic and functional members.
The term laryngeal pathologies (LP) comprises a wide range of disorders, the most frequent ones being organic, and affecting the morphology of the excitation organs and producing irregular vibration patterns [54].Some examples of these disorders are polyps, nodules, edemas, and carcinomas.The phonation in these cases is characterized by noisy bands in the spectrogram, instability in the vibration frequency of the vocal cords, irregular airflow, and the presence of turbulent noise.

Experiment 1: Meaning of Lognormal in Speech and ˛ Empirical Estimation
To assess the meaning of a lognormal in speech, the first experiment aimed to study the relationship between lognormals and phonemes.Additionally, as the velocity is a function of α (the proportion between formants), the optimum values of this constant are estimated in this experiment to be compared with the theoretical estimation in "Formants to End Effector Kinematics." For this assessment, we employed the publicly available VTR-TIMIT database of continuous speech, which is labeled by phonemes, thus providing the number of phonemes (N p ) in each sentence.All the sentences of this database were analyzed by ScriptStudio [31] and decomposed into a sequence of lognormals.The number and timing of lognormals were compared with the phonemic labels of the database.The velocity was obtained from the formant track provided by the dataset.
An example of such an analysis is shown in Fig. 5.It corresponds to an excerpt ("Their records") from the sentence "How permanent are their records?" in English.In this figure, we can observe the speech waveform, the spectrogram with its formant track, and the lognormal decomposition of the velocity.In Fig. 5, it can be seen that there are almost as many phonemes as there are lognormals.Besides, the lognormals are temporally ahead of the phoneme as the movement between two phonemes precedes the sound.This is shown in Fig. 6, which is zoomed in Fig. 5. Further, we can observe how the velocity peak usually appears alongside the phoneme transition, since a fast change in the resonance cavities is required to pronounce the next sound.As well, we can see that when the duration of the phonemes is long or the articulation of the phoneme requires the pronunciation of more than one simple movement, more than one lognormal appears, as is the case between 1.1 and 1.2 s in Fig. 5.To illustrate how the correspondence between the phonemes and lognormals is obtained from a sentence, a study was carried, looking at one phoneme after the other.For each phoneme, four possibilities were considered: 1. True positive (TP): A lognormal of the sentence that overlaps the phoneme is assigned to it.In this case, TP i = 1, with i being the index of the phoneme.2. False positive (FP): Other lognormals of the sentence that overlap the phoneme in study.In this case, FP i is set to the number of lognormals that overlap the phoneme minus 1. 3. False negative (FN): If no one lognormal overlaps the phoneme, FN i is set to 1. 4. True negative (TN): The set of lognormals that belong to the sentence do not overlap the phoneme.In this case, TN i is set to the number of lognormals that do not overlap the phoneme.
Note that TP i + FP i + TN i is equal to the number of lognormals of the sentence.The bounds of the lognormals are considered at 5% of its peak value.The measurements of the matching between the phonemes of the sentence and the lognormals obtained with the sentence are given in terms of the true positive rate and true negative rate of the sentence and are calculated as Figure 7 shows TPR and TNR curves per gender and the mean value of both as a function of α.Although this α value used to work out the velocity from the formant track Eqs.9-11 was obtained theoretically in Eq. 17, it can be empirically validated to obtain the TPR and TNR for different α values.Moreover, to see the correlation between the velocity peak occurrence (t v ) and the phoneme transition (t p ), the relationship between them, as seen in Fig. 8, is obtained through the error rate ( t ) as: Fig. 6 Room-in of Fig. 5 to show the exact correspondence between phoneme and lognormal Fig. 7 TPR and TNR curves across the VTR-TIMIT database as a function of F1-F2 proportion value α.The value of the lognormal is greater than the 5% of its peak value For the experiments, although the value of k (see Eq. 19) does not affect the velocity profile shape or the result, it is approximated to 0.04 to keep a velocity peak close to 200 mm/s as measured by the sensors in [50].
We can see in Fig. 7 that the TPR curves get the maximum values of α around to 0.35.Further, as seen in Fig. 8, the t gets minimum values for α lying between 0.2 and 0.4, which means that the lognormal peak is closest to the phoneme in healthy adults.These results show that for both males and females, this procedure is effective and not overly sensitive to the value of α in the 0.2 ≤ α ≤ 0.4 range, which is similar to the value proposed in "Formants to End Effector Kinematics.".Note that to pronounce some phonemes, more than one simple movement is required, and each subject could need a different α value.

Experiment 2: Speech Lognormals, Aging, and Laryngeal pathologies
As speakers get older, their neuromotor systems deteriorate and movements require additional effort and become slower.In handwriting, this implies additional short strokes and slow handwriting.The same should apply to speech: a greater number of short movements or small lognormals and more time between these lognormals than in young speakers.
In this context, to gain insights into the meaning of a lognormal representation in speech, the second experiment compared the lognormals detected in young and older speakers, including subjects with laryngeal pathologies.
The experiment was run with the Saarbruecken Voice Database, which labels recorded sentences with the age of the speakers and allows comparisons between the results obtained with the groups of young and older speakers.In the cases where result shows a significant difference (NbLog, Δt o , , SNR) the experiments were repeated in order to eval- uate the evolution of the parameters along three age groups (young, middle, and older) (Table 5).Gender is omitted in the analyses that follow since the experiments in "Experiment 1: Meaning of Lognormal in Speech and Empirical Estimation" show similar results for males and females.Moreover, gender is reasonably balanced in the database, and the effect of age and gender cannot be confounded [19].
As the Saarbruecken Voice Database does not provide formant tracks, these were obtained with the following two formant estimation methods (available in the Praat software package [43]): -"From speech to formant (sl)": This algorithm is based on the implementation of the Split Levinson algorithm proposed by Willems [55].It always finds the requested number of formants in every frame, even if they do not exist.
-"From speech to formant (keep all)": In this case, Praat applies a Gaussian-like window and computes the formant from the LPC spectrum obtained through the Burg algorithm [56,57].
The following settings were used the Praat software for both methods to determine the first two formants in all the sentences of the Saarbruecken Voice Database: time step of 0.01 s, maximum number of formants of 5, and window length of 0.025 s.
To calculate the speech kinematics, based on the previous result, the parameter α was set to 0.3 and k to 0.04.The speech trajectory was processed with ScriptStudio® [30] to decompose the speech kinematics into lognormals.
The results are graphically shown in Figs. 9 and 10, and numerically in Tables 2-5.These tables also include the averaged values and the standard deviation of all the lognormal parameters, along with a one-way ANOVA (analysis of variance) [58].Multiple comparison tests with Bonferroni correction are used when three classes (young, middle, and older) are analyzed.In this type of analysis, two groups are considered as statistically different if the residual p value is below 0.05 and statistically similar if the p value is above 0.05 [58].
The findings can be summarized as follows: 1. Δt o is sensitive to the speaker's age.While there is a significant difference between the young versus the older speakers (p value <0.001), there is no significant statistical variation between healthy speakers and speakers with laryngeal pathologies for this parameter (Tables 3-5).This means that the time between commands increases with age [19], due to the increase in Δt 0 , but not with laryngeal pathologies.This is consistent with the wellknown fact regarding slower reaction times in these conditions.2. The number of lognormals (NbLog) is greater for the group of older speakers than for the group of young speakers (Tables 2 and 5; Fig. 9).The p value is lower than 0.001 with both formant estimation algorithms.This is consistent with the results observed in handwriting, where the kinematic theory was used to evaluate aging.The results might suggest that the deterioration of motor control with aging is associated with the development of compensatory strategies such as emitting more motor commands to generate an adequate movement for a given task [19].A significant difference is also observed between young healthy and LP speakers in the number of lognormals Fig. 10, Table 3).This type disease should, therefore, influence the number of lognormals due to increases in the number of simple movements following necessary pauses and silences in a sentence.3. The SNR parameter decreases and the number of lognormals (NbLog) increases in both older people and the LP group in young people (Table 3), with the difference being more significant with age than with laryngeal disease.4. The parameter increases with age, indicating that the impulse response of the system is slower in the case of older speakers.This difference is only appreciated with the "from speech to formant (sl)" method (Table 2), and this could be because this formant extraction method always gives the requested number of formants in every frame, allowing the best interpolation of the complete movement in the case of consonants. 5. Regarding the parameters , V p , and D , in Tables 2-4, no significant difference can be seen between the two age groups.6.If we compare the three age groups (Table 5) only with the NbLog parameter, significant differences are found between the three classes (Fig. 9).

Discussion
The results show how the Sigma-lognormal model can be applied to model neuromuscular aging in speech.When the speech is modeled, each of the lognormals obtained reflects a group of commands and their end muscular response shapes.Neurological diseases, learning processes, or aging can affect this command sequence, changing the proportion of final movements, the speech rate, or the muscular response shape, which is consistent with the lognormality principle [19].
In the above results, the parameters related to the time between commands ( Δt o ) and the delay in the muscular response ( ) are longer in older people, as the movements become slower with age.Moreover, the experiments show a clear relationship between the number of simple movements found by the model and the number of pronounced phonemes.This relationship is conditional on the proportion between the first and second formants used to estimate the trajectory, as we tested with the experiments.Also, the method used to detect the formants can affect the parameters obtained, providing the Sigma-lognormal method with information on how the formant extractor is able to follow muscular movements.The model was tested with two different languages (English and German) and seems to be language-independent, as has also been observed when the lognormal model is applied in handwriting [59].

Conclusions
A Sigma-lognormal representation for modeling speech kinematics has been presented.The speech kinematics is estimated from the formant tracks and decomposed into simple lognormal movements by applying the kinematic theory of rapid human movements.Moreover, besides the Sigma-lognormal parameters, a set of derived parameters is proposed to describe the timing and the neuromotor impulse response.
The experiments conducted illustrate the lognormal meaning in speech and indicate the adequate relation between first and second formants in order to get the kinematic information.The first experiment shows the link between a lognormal and a transition between phonemes, where the number of the lognormals is similar to that of phonemes.In this experiment, that the optimum proportion between the first and second formants was also verified.The second experiment links the lognormal to the generation of each end effector movement, showing that the parameter Δt o , as in handwriting, increases significantly from young to older speakers, and that it is independent of dysfunction, such as problems in the larynx or glottis closure.This allows modeling aging in speech production as a delay between commands and the end effector responses.
The results show that it is possible to model speech with the kinematic theory, which provides biological information about the simple movements involved in speech.
As future lines of research, the model could be applied to speech synthesis, speech recognition, speech rehabilitation, as well as to the design of systems to help in the screening and monitoring of some neurodegenerative diseases.The model could also permit the use of features similar to those obtained from studying other human movements, such as handwriting.Moreover, investigating the use of more formants to estimate speech kinematics is an unresolved issue that is yet to be addressed.

Fig. 1
Fig. 1 Sigma-lognormal reverse engineering of a signal.Decomposition of the velocity profile into a sum of lognormals and the analyzed trajectory for each lognormal

Fig. 2 Fig. 3
Fig. 2 Scheme of the proposed model

Fig. 5
Fig. 5 Relationship between phonemes and lognormals.Figure at the top: speech signals ("their records") segmented by phonemes, figure in the middle: spectrogram, and figure at the bottom: lognormal decomposition (color changes for even and odd lognormals), velocity profile (dotted line), and TIMIT phoneme segmentation (blue bars) TPR s = � ∑N p i=1 TP i �� N p and TNR s = � ∑N p i=1 TN i �� (NbLog − 1) , respectively.The TPR and TNR of the VTR-TIMIT dataset are obtained by averaging the TPRs and TNRs of all the sentences in it.

Fig. 8 Fig. 9
Fig. 8 Error rate across the VTR-TIMIT database as a function of F 1 -F 2 proportion value α

Table 1
Value estimated from different previous works

Table 4
Averaged value of the lognormal parameters for older healthy speakers and speakers with laryngeal pathologies

Table 5
Averaged and STD values of the lognormal parameters for young, middle, and older speakers with laryngeal pathologies ("To formant (sl)")