Emerging Verbal Functions in Early Infancy: Lessons from Observational and Computational Approaches on Typical Development and Neurodevelopmental Disorders

Objectives Research on typically developing (TD) children and those with neurodevelopmental disorders and genetic syndromes was targeted. Specifically, studies on autism spectrum disorder, Down syndrome, Rett syndrome, fragile X syndrome, cerebral palsy, Angelman syndrome, tuberous sclerosis complex, Williams-Beuren syndrome, Cri-du-chat syndrome, Prader-Willi syndrome, and West syndrome were searched. The objectives are to review observational and computational studies on the emergence of (pre-)babbling vocalisations and outline findings on acoustic characteristics of early verbal functions. Methods A comprehensive review of the literature was performed including observational and computational studies focusing on spontaneous infant vocalisations at the pre-babbling age of TD children, individuals with genetic or neurodevelopmental disorders. Results While there is substantial knowledge about early vocal development in TD infants, the pre-babbling phase in infants with neurodevelopmental and genetic syndromes is scarcely scrutinised. Related approaches, paradigms, and definitions vary substantially and insights into the onset and characteristics of early verbal functions in most above-mentioned disorders are missing. Most studies focused on acoustic low-level descriptors (e.g. fundamental frequency) which bore limited clinical relevance. This calls for computational approaches to analyse features of infant typical and atypical verbal development. Conclusions Pre-babbling vocalisations as precursor for future speech-language functions may reveal valuable signs for identifying infants at risk for atypical development. Observational studies should be complemented by computational approaches to enable in-depth understanding of the developing speech-language functions. By disentangling features of typical and atypical early verbal development, computational approaches may support clinical screening and evaluation.

from recognising inter-individual discrepancies among peers caused by slowed or divergent functional acquisition within and across developmental domains, which might indicate stagnation or regression of intra-individual development. Notably, although the very early periods of speech-language development are not yet fully understood, atypicalities in the verbal domain are often one of the first perceived signs of neurodiversity during the first year of life. exhalation/expiration phase of one breathing cycle; Lynch et al., 1995a;Nathani & Oller, 2001) and segmented accordingly. Other approaches segmenting infant speech have differentiated vocalisations through a pause criterion (e.g. pauses longer 300 ms subdivide vocalisation clusters; Oller et al., 2010). In either way, the segmentation step is usually followed by an annotation process, in which trained listeners assign vocalisations to the predefined vocalisation classes (e.g. Koopmans-van Beinum & van der Stelt, 1986;Lang et al., 2021;Lynch et al., 1995a;Nathani et al., 2006;Oller, 1978Oller, , 2000Roug et al., 1989;Stark, 1980). Recently, a citizen science study externally validated the expert classification of babbling vocalisations and the onset of canonical babbling (Cychosz et al., 2021). Together with findings on auditory Gestalt perception of experts and naïve listeners differentiating early verbal functions of infants with neurodevelopmental disorders (NDDs), this points to the existence of an intrinsic human Gestalt of different vocal categories or typical vs. atypical pre-linguistic vocalisations (Marschik et al., 2012a). Human auditory Gestalt perception, or the adult capacity of intuitively recognising different vocal categories, becomes more robust when evaluating "higher order verbal functions" of infants. Explicitly, babbling vocalisations, being more salient in form, are easier to be categorised by listeners as compared to pre-babbling vocalisations uttered in the first 5 months of life (e.g. Marschik et al., 2012a;. In the first 5 months of life, before the canonical babbling stage, the various stage-models concordantly include descriptions of a developmental pathway from simple phonation to an expansion phase (Fig. 1, e.g. Kent, 2022;Nathani et al., 2006;Oller, 2000). Oller and colleagues introduced a classification scheme of three types of infant vocalisations: cry, laughter and protophones; the latter are defined as precursors to speech and subdivided into vocants, squeals and growls (Jhang & Oller, 2017;Oller et al., 2013). Interestingly, evidence showed spontaneously produced protophones to outnumber cries and laughter from early on (Jhang & Oller, 2017;Oller et al., 2019). The importance of protophones lies, in contrast to cry and laughter, in their functional flexibility. They can be used in variable contexts and may fulfil different communicative functions (Jhang & Oller, 2017;Oller et al., 2013). Besides flexibility in functioning, the ontogeny of vocalisations has been discussed in terms of physiological constraints. Physiological adaptation of peripheral anatomical structures, such as the larynx descent or vocal-tract shape (e.g. Fitch, 2010;Lieberman et al., 2001) as well as neurophysiological changes governing the functional output, shape the development and the increasing complexity of vocalisations (see Fig. 1; e.g. Kent, 2021Kent, , 2022Oller, 2000;Zhang & Ghazanfar, 2020).
As findings regarding achievement of developmental milestones in infants with DDs were inconclusive, recent research increasingly aimed at gaining in-depth knowledge about early vocal patterns through the extraction and characterisation of acoustic features of emerging verbal functions. For example, in cry but also in spontaneous infant vocal patterns acoustic features like fundamental frequency (lowest frequency of a periodic waveform, usually denoted as F 0 ) or duration of vocalisations have been documented (Borysiak et al., 2017;Buder et al., 2013;Hamrick et al., 2019;Kent & Murray, 1982;Wermke & Robb, 2010). More complex models on analysing acoustic properties of infant vocalisations include machine learning approaches applied on a set of parameters or features on signal level (Pokorny et al., 2020;. There are established parameter sets for analysing voice features such as the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS;Eyben et al., 2015) and the Computational Paralinguistics ChallengEs parameter set (ComParE; .
The features that are included in such sets can be subdivided into three categories: parameters related to frequency aspects (e.g. pitch), parameters related to the energy or amplitude of the signal (e.g. harmonics-to-noise ratio; HNR) and spectral parameters (e.g. harmonic differences). Another common approach to produce a more specialised parameter set is the usage of the unsupervised Bag-of-Audio-Words (BoAW) approach to the best set of features according to a customised codebook quantisation of the low level descriptors (LLDs). In addition, machine learning models have been applied to vocalisations including neural networks in different varieties, testing classification tasks (e.g. adult vs. infant speech, canonical vs. non-canonical utterances;Ebrahimpour et al., 2020;Warlaumont et al., 2010). In our group, we have utilised a machine learning approach (i.e. support vector machines), that focused on automatic preverbal vocalisation-based differentiation between typically developing infants and infants later diagnosed with RTT, FXS or ASD (Pokorny et al., 2016a(Pokorny et al., , 2022. Studies evaluating acoustic features of early vocalisations or applying machine learning models or neural networks will be referred to as "computational studies" hereafter. Given recent efforts to perceptually classify preverbal vocal patterns and characterise them acoustically, there is still a lack of synergised information in the field of prodromal or pre-diagnostic development in infants with neurodevelopmental or genetic disorders, especially concerning the pre-babbling phase. Therefore, the current article aimed to (i) outline characteristics of age-specific pre-linguistic vocalisations in the first 5 months of age (i.e. the pre-babbling phase), (ii) summarise computer-based approaches for the automated analysis of physiological and pathological pre-babbling vocalisations, and (iii) compare computer-based approaches on atypical early verbal functions and outline their potential to serve as neurofunctional marker of DDs.

Methods
To address the above-mentioned issues, we systematically searched the existing literature for (a) characteristics of and (b) state-of-the-art computational and observational methods on prelinguistic vocalisations in infants with DDs. We conducted two rounds of paper extraction and selection, the first one in September 2021 and a second one in February and March 2022 in the following online electronic databases: PubMed, Web of Science, Science Direct, Scopus, and Psy-cINFO using the search strings "infan* AND (prelinguistic OR preverbal OR cooing OR babbling OR vocal) AND (syndrome OR "genetic disorder" OR "developmental disorder")" and "infan* AND vocal* AND ("computational analysis" OR "acoustic analysis" OR "audio analysis")".
Following this initial step, we performed an ancestral search for papers from the retrieved articles and searched Google Scholar for further publications. The retrieved articles were screened by two independent raters (CW and SL). Results were discussed with the co-authors, duplicates were removed, and articles were selected according to the following criteria: (1) peer-reviewed; (2) original studies or reviews and meta-analyses; (3) written in English; and (4) focusing on the pre-babbling age (0 to 5 months) in (4a) typically developing infants and (4b) infants at elevated likelihood for or diagnosed with neurodevelopmental disorders (NDDs), late detected developmental disorders (LDDDs), genetic syndromes, or developmental disorders (DDs). Articles of interest were those based on human coder-based assessments (observational studies) as well as articles on machine learning approaches (computational studies). We intended to focus on spontaneous infant vocalisations and excluded all studies analysing or reporting infant cry or distress vocalisations as well as vocalisations from parent-child interaction paradigms (PCI).

Results
Our literature selection process led to a total of 27 papers, 17 of which are on pre-babbling in infants diagnosed with neurodevelopmental disorders or genetic syndromes applying observational methods (Table 1). Six articles focused on DS, seven on ASD (one of them also including infants with TSC), three on RTT or the preserved speech variant of RTT (PSV), and one on PWS. Two of the 17 articles reported acoustic features in addition to observational characteristics. The remaining ten articles focused on acoustic features/ computational models, three studies applying computational methods on pre-babbling behaviour in TD infants (Table 2) and seven papers discussed the babbling stage in infants later diagnosed with a DD (i.e. ASD, CDS, PSV-RTT, RTT, WS and one study reporting on ASD, FXS, and RTT; Table 2). It is important to note that a differentiation between spontaneous vocalisations vs. vocalisations in interactive settings could not be reliably done for all articles. Thus, against the initially set exclusion criterion, we decided to report all observational studies of this age-range and outlined information on data sampling whenever possible (Tables 1 and 2).
Whilst there is a number of studies reporting early physiological development according to the established stage models (Fig. 1), reports of atypical development in infants with neurodevelopmental disorders or genetic syndromes in the younger ages are rare ( Table  1). Most of the 17 included studies report on expanded age-bands up to 24 months; very few explicitly investigate the characteristics of early verbal functions emerging in the first 5 months of life (Brisson et al., 2014;Maestro et al., 2002;Pansy et al., 2019;Zappella et al., 2015). Most studies investigate developing verbal functions applying the classical approach of perceptual segmentation-annotation-classification. There is less effort present in delineating acoustic features (such as duration of vocalisations, syllables or phrases, pitch, fundamental frequency (F 0 ) or intonation contours; Brisson et al., 2014;Lynch et al., 1995a). Observational studies reveal inconclusive results on behavioural differences in pre-babbling vocalisations in infants with DDs and typical development. Compared to TD infants several diverse behaviours have been reported for DD: e.g. longer duration of rhythmic units in infants with DS (Lynch et al., 1995a); divergent intonation contours and less vocal response in interactive settings (Brisson et al., 2014); some participants with ASD failed to achieve the developmental milestone "cooing" (Maestro et al., 2002;Zappella et al., 2015); typical vocalisations interspersed with atypical forceful and/or inspiratory vocalisations in infants with RTT (Marschik et al., 2009); more details on age-specific vocalisations and characteristics of this period are outlined in Table 1.
More advanced methods such as digital measurement instruments and computational analyses open new possibilities for earlier identification of atypical development, as they surpass human capabilities of perception. Most approaches identified aim to describe and investigate trends in the typical development of vocalisations throughout the first 5 months of life. Very early studies focus on a categorical analysis of vocalisations, applying spectral analysis to gain additional insights in addition to the verbal Gestalt-perception (Buder et al., 2008;Lynch et al., 1995a;Oller et al., 2019;Warlaumont et al., 2010). The spectra analysed were acquired through the application of a window function. Most commonly, a fast-Fourier transformation is used to present results as a graphical visualisation, showing the intensity of frequencies at a point in time (Heideman et al., 1985). With the resulting graphical representation, one can visually determine fundamental and formant frequencies (F 0 and F n respectively) and the general "shape" of a vocalisation (Bauer & Kent, 1987;Kent & Murray, 1982;Oller et al., 2019). The method of spectrography has been applied in studies over the last 3 decades, finding specific intonation patterns in pre-babbling vocalisations and a developmental trajectory of the F 0 and F n (Kent & Murray, 1982). Oller and colleagues used spectrograms to visualise examples of vocants, squeals, growls, and cries at specific ages, providing a visual description of the noise found in the signal as well as other unique features (e.g. F 0 contour) of the analysed classes of utterances (Oller et al., 2019). Another feature, which can be identified through inspection of the spectrogram or the waveform of a vocalisation, is the duration of a single utterance. The duration is used in several studies to gain an understanding of how utterance durations change with age (Apicella et al., 2013;Brisson et al., 2014;Lynch et al., 1995a;Smith & Oller, 1981).
More in depth analyses of audio signals require multi-dimensional parameter sets to provide feature-based representations of the underlying audio segment to a classifier, which can then build an optimal predictor for the classification scheme provided. There are pre-defined parameter sets that are commonly utilised in linguistic and acoustic analyses. Such parameter sets are for example the Computational Pralinguistics ChallengEs parameter set (ComParE;  or the eGeMAPS (Eyben et al., 2015). These parameter sets consist of low-level descriptors (LLDs). LLDs are parameters that are very closely related to the signal itself (e.g. fundamental frequency F 0 . loudness). To gain further insights about the general occurrence and statistical behaviour of those LLDs, functionals (e.g. mean, kurtosis, variance) are used on top of these ).
Yet, in the field of pre-babbling vocalisations, most studies rely on basic features such as duration or fundamental frequency to gain a more in depth understanding of infant vocalisations (Apicella et al., 2013;Brisson et al., 2014;Lynch et al., 1995a;Smith & Oller, 1981). Visual spectrogram analysis has been used to evaluate different vocalisation shapes and help estimate signal to noise ratios in certain vocalisation types (Oller et al., 2019). These approaches, whilst not utilizing advanced computational methods, highlight the importance of particular features for identification of certain vocalisation types and analysis of developmental trajectories. Lynch and colleagues, who focused on a comparison between TD children and children with DS, present the only study that employs a featurebased approach in the analysis of pre-babbling vocalisations in infants with DDs (Lynch et al., 1995a). In this study, the duration of utterances was compared between DS and TD children across respective timelines. For the first 5 months of life, no significant difference was found between TD infants and infants with DS. Nevertheless, the duration of utterances increases until 8 months of age and then decreases until 12 months of age, continuously diverging between TD and DS groups (Lynch et al., 1995a). Although the methodology is not sensitive enough for an accurate differentiation between the two studied groups, it provides a starting point in the identification of possible features that can be used for future analysis of pre-babbling vocalisations (Lynch et al., 1995a). This early phase of verbal development is not yet very well researched in terms of the effectiveness of the aforementioned parameter sets (i.e. ComParE & eGeMAPS). So far, there is a lack of studies applying advanced computational approaches as well as comparative studies that enable rendering a verdict on their applicability (see Table 2). Deep learning approaches have been applied to different settings (e.g. interactive settings, home recordings; Pokorny et al., 2020) of pre-segmented infant audio signals to solve superficial classification tasks (e.g. infant vs. adult, canonical vs. non-canonical). However, none of these studies focused on infants with DDs in the first few months of life (Ebrahimpour et al., 2020;Warlaumont et al., 2010).
Several studies on machine learning approaches applied to vocalisations in the first year of life (pre-babbling and babbling) were identified. In the pre-babbling phase, only three studies utilised approaches beyond the manual analysis of LLDs in the assessment of vocalisations in TD infants (Table 2). To the best of our knowledge, there are no studies available in infants at risk or with a later diagnosis of DDs. These approaches investigate the effectiveness of different neural network architectures (i.e. convolutional neural network, self-organising map and perceptron hybrid network), input features (i.e. spectrograms, waveform, parametric representation), and classification schemes (i.e. infant-directed speech vs. adult-directed speech, infant vs. adult, vocalisation vs. non-vocalisation, canonical vs. non-canonical; vocant vs. squeal vs. growl; Ebrahimpour et al., 2020;Li et al., 2021;Warlaumont et al., 2010). Opposed to that, in the babbling phase, a number of studies analyse verbal capacities utilizing computational approaches (e.g. Pokorny et al., , 2020Pokorny et al., , 2022. In general, manual analysis of LLDs such as fundamental frequency (F 0 ) is not very common for babbling vocalisations. Spectrographic analysis is very often used only for representational purposes, e.g. to represent different syllable types (e.g. Poeppel & Assaneo, 2020). For analysis and detection of atypical development by utilising computational methods, the number of approaches described is limited (Table 2).

Discussion
Some 40 years ago, the field of early infant vocalisation study was revolutionised with new ways to assess, measure and interpret early development (Koopmans-van Beinum & van der Stelt, 1986;Oller, 1978;Papoušek, 1994;Roug et al., 1989;Stark, 1980). Since then, we have learned a lot about infant prelinguistic development and vocalisation categories. Most studies, however, focused on babbling and the emergence of first words (second half of the first year of life) whilst the pre-babbling phase (first months of life), especially in infants at elevated likelihood for or diagnosed with neurodevelopmental disorders and genetic syndromes, was less researched.
The very early phase of verbal development is mostly described through the achievement of certain milestones (e.g. phonation, cooing, expansion) or via perceptual assignment of infant vocalisations to certain types (e.g. vocant, canonical syllable). Another, albeit still rarely used approach is the description of infant vocalisations through acoustic features (e.g. duration, mean pitch, F 0 ). Studies have only recently focused on the investigation of quantitative changes of different vocalisation types in the first 5 months of life (Jhang & Oller, 2017;Oller et al., 2013Oller et al., , 2019. However, these studies have not assessed infants with developmental disorders or genetic syndromes so far. Threshold definitions, such as the canonical babbling ratio (CBR) applied in the second half of the first year of life, have to the best of our knowledge, not yet been developed or used for types of pre-babbling vocalisations. For the later stages of development, a number of different approaches to define the onset of certain functions (e.g. canonical babbling) providing similar critical time periods in which milestones are achieved (Lang et al., 2021;Molemans et al., 2012;Oller, 2000), have been proposed. Oller and colleagues (Oller et al., 1998(Oller et al., , 1999 reported that delayed onset of canonical babbling is a precursor to later adverse linguistic functioning. Whether precursors of atypical development may already be detected in earlier vocalisations has not yet been investigated. Further research observing typical verbal development is still needed for a basis to understand deviant patterns and trajectories. Besides pioneering the field of perceptively evaluating infant vocalisations, Oller and colleagues were also at the fore-front to propose semi-automated recording and analytical tools for the assessment of infant vocalisations (e.g. LENA system; Oller et al., 2010). Challenges of recording preverbal data as well as advantages of automated tools for the acquisition and analyses of acoustic features have been increasingly discussed (Pokorny et al., 2020). The aim of this article is not to discuss pros and cons of automated data acquisition approaches but to focus on whether such undertakings have been utilised in the study of infant vocalisations in the first half year of life, in typical cohorts, in individuals at elevated likelihood for DDs, or groups with DDs or pre-/perinatally diagnosed disorders.
When looking beyond behavioural observations and general perceptual evaluations of early infant vocalisations, there is a lack of computational methods that study, substantiate, and support the findings of observational studies. We found that despite the existence of thoroughly tested computational approaches for babbling-vocalisations, there are no attempts to use these methods in the evaluation of pre-babbling vocalisations. These perceptually less salient vocalisations, as compared to canonical babbling, have preferably been studied through simple LLDs such as F 0 and duration. Only a few studies have used more advanced computational approaches to prove the applicability and value of such approaches in the field of pre-babbling vocalisations (Ebrahimpour et al., 2020;Li et al., 2021;Warlaumont et al., 2010). Besides missing analytical approaches, there is also a lack of standardisation of coding-schemes and datasets, which impedes the comparability of performance between applied computational models in the field of speech-language analysis in the first 5 months of life. Additionally, the sample sizes investigated in observational and computational studies are usually small (i.e. 1-119; see Table 2). Generalisation capabilities of machine learning approaches applied on small dataset sets are questionable. Computational or feature-based approaches are underrepresented in studying pre-babbling vocalisations, especially in infants with NDDs (Brisson et al., 2014;Lynch et al., 1995a). To fingerprint early neurofunctional development and its deviations , we need in-depth understanding of physiological functioning as well as disorder specific characteristics. Early verbal development is one domain of interest cluing in the integrity of the developing nervous system. Recent development of analytical tools appear well suited for analysing pre-linguistic vocalisations at pre-babbling age to enhance our insights into emerging early verbal functions. Pioneer work is required to verify computational tools in identifying disorder-specific features in early vocalisations, which may inform future clinical diagnoses and be used for monitoring therapeutic success.  Studies analysing pre-babbling behaviour in infants with neurodevelopmental disorders or genetic syndromes applying observational methods (ascending order)

Observational studies on pre-babbling in neurodevelopmental disorders or genetic syndromes
Authors ( Table 2 Studies applying computational methods on pre-babbling behaviour in typically developing infants and on babbling behaviour in infants with neurodevelopmental disorders or genetic syndromes (ascending order)

Computational studies on pre-babbling in TD
Authors ( • Atypicality was mainly related to the characteristic"timbre" and prosodic, spectral, and voice quality features in the acoustic domain

•
More than 50% of the vocalisation of the infant with PSV-RTT were rated as atypical by at least one listener • Classification of TD vs. WS and WS with ASD/ID vs. WS without ASD/ID using a decision tree with a depth of one

•
The approach achieved an accuracy of 76% when classifying between infants with WS and TD infants and an accuracy of 81% when distinguishing between WS with ASD and WS without ASD based on assessed vocalisation types • It was found that especially in the classification of WS with ASD and WS without ASD, synchronicity and reciprocity were important factors in the interactive setting ASD autism spectrum disorder,