1 Introduction

Deep-learning architectures are becoming increasingly complex, and their required training datasets are increasingly extensive [1]. Automatic speech recognition (ASR) systems based on deep-learning models (DL-ASRs) can reach very high performance on controlled speech and have opened the way to a vast range of voice-activated applications [2,3,4]. However, these powerful models only partially disclose how they achieve speech recognition. They leave open questions such as (i) which underlying rhythmic processes are being modelled, if any, and (ii) whether their unexplained, brute force modelling approach could be substituted by a psycho-acoustic-inspired and less hardware-demanding approach. These aspects are essential to enhance ASR performance and efficiency. Psycho-acoustic-inspired ASRs can indeed achieve comparable performance to DL-ASRs with much less (even of orders of magnitude) training material and computational time and energy expense [5]. Moreover, human speech recognition’s effectiveness strongly depends on the supra-segmental processing of speech (including rhythm). Human speech processing has indeed been explored from multiple perspectives in recent decades. In particular, the role of rhythmic scan has attracted interest as a way to describe attentional patterns supporting acoustic information weighting in real time [6, 7]. The concept of syllable has been extensively used to describe acoustic correlates of rhythmic patterns. From a phonetic point of view, a syllable has been defined [8] as “a continuous voiced segment of speech organised around one local loudness peak, and possibly preceded and/or followed by voiceless segments”. In a more acoustic-oriented definition [9, p. 70] that considers co-articulation dynamics, the syllable consists of “a centre which has little or no obstruction to airflow and which sounds comparatively loud; before and after that centre [...] there will be a greater obstruction to airflow and/or less loud sound”. Consequently, a syllable has been described as a 100–250-ms signal segment constructed around a high energy peak (nucleus), possibly preceded by an increasing energy slope (onset) and followed by a tail of decreasing energy (coda). Syllables have been used as the basis for theories describing how speech processing has developed in human beings over time [10], and several studies have highlighted their crucial importance in speech perception and recognition [11,12,13,14,15,16]. Syllables can indeed be perceived even if they are reduced or not actually uttered [17,18,19]. However, syllabic speech units pose difficulties in linguistics and psycho-acoustics, e.g. different language experts sometimes disagree on positioning their boundaries [20, 21].

A high-performance DL-ASR might internally reproduce linguistic information at different levels of abstraction; in particular, we argue that one of these possible representations could be related to acoustic syllables ‘spontaneously’, in a way that could be sufficient to teach another automatic system to recognise these syllables simply using the internal vector embeddings as features. These representations could be found in the deep-learning model’s encoding layers and would be automatically formed, while the model learns to recognise speech units and language [22]. End-to-end (E2E) models belong to this class of deep-learning models (Sect. 2). They automatically model all speech recognition mechanisms, from base speech units to the language model. Through model-explainability techniques, it is possible to verify if the specific internal dynamics of these models resemble those of human speech perception. If these dynamics exist, they can be explored, understood, and explicitly re-embedded in the ASR model. This approach can lead to significant technological breakthroughs because a psycho-acoustics-inspired ASR model would require much less training data and simpler hardware to achieve a high performance [5]. This research would thus support a more computationally accessible artificial intelligence, a relevant problem in modern technology [23, 24] (Sect. 2).

In the present paper, we use the syllable as the central acoustic unit of an investigation of end-to-end ASR models (E2E-ASRs). We analyse the internal learning processes of a single E2E-ASR architecture with three different sizes of internal encoding and decoding modules. We demonstrate that these models automatically developed in their shallower layers a model of rhythmic patterns related to syllables. These patterns resembled syllabic-scale human speech perception processes that past linguistic and psycho-acoustic studies have also described [12, 13, 25,26,27,28,29,30,31,32]. To this aim, we built automatic syllable boundary detectors working on the vectors extracted from the internal ASR models’ encoding layers. These detectors allowed us to identify the layers where syllable-related information was formed and calculate rhythmic and intensity properties.

This paper is organised as follows: Sect. 2 reports background and related work on deep-learning-based ASRs and their explainability. Section 3 describes the end-to-end ASRs used in our experiment, the syllable boundary detectors we built, and the data we used for the evaluation. Section 4 reports the performance of our syllable boundary detectors and supports the demonstration that the inner layers of the used ASRs contain syllable-related information. Finally, Sect. 5 draws the conclusions.

2 Background and related works

An E2E-ASR automatically transforms a sequence of input acoustic feature vectors (possibly raw samples) from an audio signal into a sequence of graphemes or words representing the audio transcription [3]. Conventional ASR systems usually train acoustic, pronunciation, and language models separately and require specific modelling and training of these parts. E2E-ASRs overcome the difficulties and cost-ineffectiveness of the data preparation and modelling phases of conventional systems, by committing the model to learn all parts automatically. E2E-ASRs can perform comparable to conventional systems but require far more training data [5].

Despite the great interest by academics and industry in E2E-ASRs, their usage in production environments encountered obstacles due to practical issues like insufficient client streaming capabilities, high latency, and low multi-application-context adaptability [33]. Moreover, having all information hidden in a complex deep learning model limits understanding the model’s internal dynamics and the confidence in using an E2E-ASR for commercial or industrial applications. Additionally, the continuous increment in the models’ size (i.e. the number of layers and parameters) limits their executions to powerful machines only—usually residing in cloud computing infrastructures—rather than to the local users’ computers or small embeddable devices. The required data volume and computational capacity to handle model training from scratch increase with model complexity and size so fast that significant investments are necessary to train even one model. Today, this trend allows only a few institutions and companies to develop state-of-the-art E2E-ASRs.

This section describes the deep-learning models typically used in E2E-ASRs (Sect. 2.1). Then, it provides an overview of the methodologies used to explain the internal information representation formed in these models (Sect. 2.2).

2.1 End-to-end automatic speech recognition models

Using E2E models has been a turning point in automatic speech recognition [33]. E2E models enabled the possibility of merging acoustic and language modelling into one system whose task was to convert an input vector sequence into another (Fig. 1). Today, E2E-ASRs are frequently based on Transformer deep-learning architectures [22, 34]. A Transformer processes sequences of acoustic data vectors and automatically models the information in the vector sequence as a whole and the vectors’ inter-dependencies. As a result, it automatically infers speech units and the language model. Transformers commonly use an encoder–decoder architecture: The encoder forms an internal representation of the raw input (e.g. speech units-related data) that also contains information on inter-vector relations. A sequence of encoding layers is usually adopted to create more and more abstract data representations. The decoder translates the encoded data into an output vector sequence. A sequence of decoding layers can be used to refine the output sequence iteratively. The final output of a Transformer-based ASR model is the phonetic or word transcription of the audio input. Transformer-based ASRs require a far more extensive training set than conventional ASR systems to achieve comparable performance [35,36,37]. A Transformer ASR model typically includes a self-attention module in its architecture [38,39,40]. Self-attention estimates the influence of all preceding and subsequent input data vectors when processing each single data vector. This mechanism was introduced to mimic human cognitive attention because it relates one data vector in the input sequence to all its contextual data vectors. Through reverse engineering or deep-layer probing, it is possible to analyse the outputs of each encoding, decoding, and self-attention layer to understand whether these reflect perception-related processes [41,42,43,44].

Fig. 1
figure 1

Example architectural schema of an end-to-end automatic speech recognition model

The Transformer encoder can be implemented as a sequence of ‘ Conformer’ blocks [45] (Fig. 2), each combining a four Feed-Forward Artificial Neural Network sequence with a final normalisation layer. The name ‘Conformer’ is commonly used to indicate a Transformer with this encoding method. The Transformer decoder can be substituted by (or combined with) a Connectionist Temporal Classification (CTC) model [35] or a Recurrent Neural Network Transducer (RNN-T) [46]. CTC is a non-auto-regressive speech transcription technique which collapses consecutive, all-equal transcription labels (character, word piece, etc.) to one label unless a special label separates these. The result is a sequence of labels shorter or equal to the input vector sequence length. The CTC is one of the most diffused decoding techniques. As non-auto-regressive, it is also considered computationally effective because it requires less time and resources for the training and inference phases. Conversely, the RNN-T (also named Transducer) is an auto-regressive speech transcription technique which overcomes CTC’s limitations, i.e. non-auto-regressive and limited label sequence length. An RNN-T (also named Transducer) is a speech transcription technique which can produce label-transcription sequences longer than the input vector sequence and models long-term transcription elements’ inter-dependency. A Transducer typically comprises two sub-decoding modules: one that forecasts the next transcription label based on the previous transcriptions (prediction network) and the other that combines the encoder and prediction-network outputs to produce a new transcription label (joiner network). These features improve transcription speed and performance with respect to CTC at the expense of more training and computational resources required [46].

Fig. 2
figure 2

Typical architectural schema of a Conformer block within a generic deep-learning model’s encoder

2.2 Model explainability

Among the drawbacks of modern deep learning systems, the most frequently cited are the low accessibility of sufficient training corpora, the high demand for computational resources, and their poor interpretability (i.e. the explanation, understanding, and trust of the decisions and outputs) [47,48,49,50,51,52,53,54,55,56,57]. DL-ASRs are not exempt from these issues [58, 59]. However, the interpretation of the internal model dynamics and overall ‘behaviour’ can be studied through model-output backtracking or simulated via explainable methods [60,61,62,63,64,65]. Alternatively, ‘probes’ can be installed on the encoding (and/or decoding) layers at different ‘depths’ of the layer sequence [66, 67]. The probes allow observing and then analysing the vectors produced by the layers (emissions or emission vectors) to classify some phenomenon and consequently characterise the information contained in these vectors [68]. For example, in computer vision, model probing allows associating focus areas or abstraction patterns to specific deep neural network layers [69].

Probing is often used to interpret specific E2E-ASRs’ layers [66, 67]. For example, the DeepSpeech2 E2E-ASR [70] was probed to study the differences between the models trained in English and Arabic [71]. The probed emissions were classified through a Feed-Forward neural network to evidence that the two models were learning specific linguistic characteristics related to articulation manner and place. Other probing studies have analysed, through classification and measurements, how accent influences the DeepSpeech2 performance [72, 73]. These studies also evidenced how the contextual phonetic information contained in the emissions influenced the classification tasks. Probing has also been used to investigate the multi-temporal modelling of phonetic information in the Wav2Vec2.0 E2E-ASR [74, 75]. A recent study [76] has proposed an in-depth analysis of the layer-wise encoded information of the pre-trained Wav2Vec2.0 large and small-sized models. The study evidenced the presence of phone-level and word-level information at layers 11–12/17–18 (phone) and layers 11–19 (word) for the large-sized model. Some studies have also proposed a spectrogram-like representation of emissions that could be used for speaker identification and speech synthesis [77].

Probing studies have seldom analysed the possible multi-scale (e.g. phonetic, syllabic, word), supra-segmental (e.g. rhythm, pitch), modelling occurring in E2E-ASRs, e.g. the presence of syllabic-scale or rhythmic components also existing in human speech recognition [78]. The existence of these components would indicate the presence of interpretable information in the emissions, which could adequately be re-embedded in the ASR model to improve performance while decreasing computational complexity [5]. For example, syllabic information can be primary in data pre-processing for efficiently selecting informative data from large corpora that would improve ASR performance and make it comparable to a system using a much larger amount of data [79]. When used within a maximum-entropy principle for acoustic feature selection and uncertainty quantification [80], syllabic information can help choose training utterances contributing to homogenising information distribution across speech units [81, 82]. Moreover, syllabic spectral analysis can reveal syllabic structural changes related to language evolution and anthropological dynamics [83, 84]. Moreover, syllables are often central speech units in the design of ASRs targeting under-resourced languages or limited-vocabulary applications [85,86,87,88,89,90].

3 Materials and method

This section describes our probing of three state-of-the-art Transformer-based ASR models. Rather than searching for word or phonetic scale representations forming within the Transformer encoding layers, we focused on syllable-scale representations. We investigated whether the emissions of the three analysed Transformers were valuable for building a high-performance acoustic-syllable boundary detector. Moreover, we explored whether specific Transformer layers formed syllable-related representations.

The present section is organised as follows: Sect. 3.1 describes the base Transformer ASR models analysed. Section 3.2 describes the syllable boundary detectors we built for the probing task. Finally, Sect. 3.3 describes the data used for training and testing the syllable boundary detectors.

3.1 Transformer ASR models

The Transformer ASR model architecture we probed was a Conformer model from the Nvidia NeMo Automatic Speech Recognition toolkit [91]. Nvidia distributes three pre-trained versions of this ASR model with different ‘sizes’—corresponding to different Conformer block sequence lengths—based on the NeMo ASRSET-2.0 open-source corpus in English (Table 1). The NeMo toolkit and the pre-trained models aim to provide academic and industrial researchers with state-of-the-art tools to build conversational agents.

Table 1 Summary of the three transformer-based automatic speech recognisers used as the basis of our experiment

The used Conformers contain a Conformer block sequence in its encoder module (encoding layers). The decoder uses a Transducer for word-based decoding or, alternatively, a CTC model for character-based decoding. For the present study, we used word-based decoding because we addressed the detection of speech units with a larger scale than the phonetic scale (which roughly corresponds to character-based decoding). Table 2 reports the number of encoding and decoding layers across the pre-trained Conformers.

Table 2 Number of layers and neurons-per-network in the encoder and decoder modules of the three Transformer Automatic Speech Recognisers probed

In the present experiment, we probed the encoding layers’ emissions of the three Nvidia Nemo Conformers (hereafter generally indicated as Transformer ASR models) to search for evidence of syllable-related information automatically forming in these layers.

3.2 Syllable boundary detection

To investigate whether the Transformer ASR models internally formed a rhythmic or syllabic-scale component in specific encoding layers, we trained new machine-learning models with the emission vectors of different encoding layers. Each model was trained to classify one emission vector from one encoding layer as corresponding to syllabic boundary presence or absence. We trained one detector for each encoding layer and Transformer ASR model size to verify (i) whether specific encoding layers contained sufficient information for syllable boundary detection and (ii) whether the correct detections could be associated with long and intense (prominent) syllables, which are integral to human speech recognition [11,12,13,14,15,16].

We used new training and test material to build the syllable boundary detectors (Sect. 3.3). The Transformer ASR models were preliminarily executed on the training set audio files to produce word-level transcriptions. Probing was conducted by acquiring tensors (later flattened into vectors) at the output of one encoding layer at a time. Specifically, all emission vectors \(\{e_{l,m}\}\) (of length h), belonging to the probing of the l-th layer (between 1 and 16 or 17) of Transformer m (with m among Small, Medium, and Large) were saved as the training data for a new syllable boundary detection model.

The syllable boundary detection model was a Long Short-Term Memory (LSTM) model followed by a binary-classifying Feed-Forward artificial neural network (Fig. 3). LSTMs are suited models for classifying a time series of observation vectors [95] and making predictions out of historical data [96,97,98,99]. In the conventional architecture, an LSTM consists of one computational unit that iteratively processes all input time series vectors. This unit comprises three gates that process one vector at a time while combining this vector with information extracted from the previously processed vectors. All gates are realised as one-layer Feed-Forward neural networks with the same number of output neurons (hidden-layer length, n) and tanh or sigmoid activation functions. The gates’ outputs are further processed by an output gate that produces an output vector with size n for the input vector processed at time t. The LSTM hidden-layer length is the crucial model parameter to optimise for gaining optimal classification performance. Our LSTM processed a sequence of \(\{e_{l,m}\}\) emission vectors (each of length h) and produced a new sequence of vectors with size n. The two sequences were aligned over time. For each time step t, the Feed-Forward network produced a binary decision for syllable boundary presence (1) or absence (0) based on the LSTM hidden-layer output. In summary, we trained and tested different LSTM-based syllabic boundary detectors (\(L_{n,m,l}\)) for all possible n, m, and l combinations and studied the models’ performance while searching for evidence of syllable-related properties in the models’ decisions. To reduce overfitting risk, we also enabled a dropout neuron-selection strategy for the LSTM gates, which statistically excluded (with a 0.2 probability) one neuron and its weights at each training iteration [100]. Notably, our syllabic boundary detectors’ temporal sensitivity (the minimum difference between consecutive time steps) was 40 ms because all Transformer ASR models produced emissions at this rate.

Fig. 3
figure 3

Schema of our syllable boundary detector: t indicates the temporal index of the currently analysed frame; n is the LSTM hidden-layer length; h is the probed Transducer model’s emission-vector length; d is the binary decision (0/1) produced by a classification feed-forward artificial neural network

3.3 Experimental data

For testing the syllable boundary detectors’ performance, we used a dataset annotated by Italian and Spanish experts, available for the members or customers of the CLARIN research infrastructure [101]. The Italian corpus contained 68 wave files from 11 speakers recorded at a 16-kHz sampling frequency for a total of 3.5 min of annotated audio. The Spanish corpus contained 45 recordings from 6 speakers for a total of 2.8 min of annotated audio.

Annotations were available in Textgrid format [102] and contained the following annotation levels:

  • Word: the word-by-word orthophonic transcription of the speech signal;

  • Syllable-phonetic: the phonetic pronunciation of the uttered syllables;

  • Syllable-phonologic: the expected syllable transcription according to the word-level transcription and the phonologically predictable reduction processes.

These three annotation levels did not necessarily correspond to synchronised boundaries because they were produced independently from each other, and perceptive differences exist in the human recognition of the different levels [15]. In our experiment, we used the syllable-phonetic level to detect acoustic-related syllable boundaries. Acoustic-related syllables are indeed the only syllable types an ASR model can extract from the raw audio signal without using an externally provided linguistic model. Additionally, merging the Italian and Spanish corpora was plausible because these languages belong to the same linguistic family and have similar syllable boundary definitions. The merged corpus allowed us to produce statistically significant results.

We used the Italian-Spanish merged corpus to train and test our LSTM-based syllable boundary detectors. We split the corpus into train, validation, and test sets using 60, 20, and 20% percentages while ensuring that these sets did not share the same speakers. This choice aimed at guaranteeing that results were mostly speaker-independent. In the data preparation phase, we associated the Transformer ASR models’ emissions with syllable boundary presence or absence and then used this association for model development and testing. Therefore, we prepared separate vector datasets for each probed emission layer of each Transformer ASR model. Based on these data, we conducted a two-step analysis: First, we detected the three most promising parametrisations of the LSTM-based syllable boundary detectors. These models were selected as having very different lengths of the LSTM hidden layer and achieving comparable high performance on the validation data. They allowed us to study performance variation across different resolutions of emissions’ encoding and processing in the LSTMs. Second, we compared the syllable boundary detectors’ performance across the Transformer ASR models and encoding layers. We also tested whether syllabic information detection was independent of emission encoding resolution in the LSTM.

Using Italian and Spanish corpora to train and test the syllable boundary detectors—although the Transformer ASR models were originally trained in English—was a reasoned experimental choice. Indeed, our target was to study rhythmic, syllable-related information rather than syllable recognition. Rhythm is a language-independent feature, whereas syllable recognition is a language-dependent one [103]. Therefore, using languages other than English in the probing task allowed us to study the language-independent features of the syllabic and rhythmic components modelled by the Transformer ASR models.

4 Results

This section reports the performance measurements of our LSTM-based syllable boundary detectors across the probed encoding layers of the Transformer ASR models. In particular, Sect. 4.1 reports standard evaluation metrics for detecting the optimal LSTM parametrisation per Transformer ASR model, and Sect. 4.2 reports a syllable-oriented acoustic characterisation of the detected syllable boundaries.

4.1 Error rate and statistical significance

4.1.1 Considered metrics and measurements

We evaluated all possible \(L_{n,m,l}\) models to identify the optimal LSTM hidden-layer length (n) and encoding layer depth (l) per Transformer ASR model size (m). The three best models achieving an overall high performance, with sufficiently different n, had the following hidden-layer lengths: 160, 320, and 740. We compared these models across all m and l combinations for a total of 172 L models trained and tested.

We used the SCTKFootnote 1 evaluation suite of the National Institute of Standards Technologies (NIST), a commonly used reference tool, to measure the \(L_{n,m,l}\) models’ performance. In particular, in compliance with other syllable boundary detectors [104,105,106], we measured the model Word Correct Rate (WCR) as the fraction of correctly classified words (i.e. \(\textrm{WCR}=\frac{{\text {Number of correctly classified words}}}{{\text {Total words}}}\)). The Word Error Rate metric, commonly used by other works, corresponds to \(1-\textrm{WCR}\). In our experiments, word corresponds to a syllable-boundary label indicating presence or absence in a 40-ms segment.

4.1.2 Evaluation

Figure 4a–c reports the WCR charts grouped by LSTM hidden-layer length. The x-axis indicates the probed Transformer ASR model’s layer depth index, and the colours indicate the three Transformer ASR models analysed. The y-axis reports the WCR. For example, in Fig. 4a, \(x=0\) compares the syllable boundary detector with \(n=160\), trained on the emissions extracted from the first layer (\(l=1\)) of the Small, Medium, and Large Transformer ASR models separately (i.e. for all m values).

The general trend emerging from the charts is that lower encoding layers contained higher discriminant information for syllable boundary detection, which decreased in deeper layers. Layers with depth indexes between 3 and 6 contained the most valuable information, with the 4th and 5th depth-index layers being the most informative. This observation was valid across all Transformer ASR models. The detectors’ WCRs were more similar across the Transformer ASR models as far as the LSTM hidden-layer length increased. This observation indicates similar information encoding in ‘long’ LSTMs, which compensated for smaller Transformer ASR model sizes (Fig. 4c).

Fig. 4
figure 4

Word correct rate of our LSTM-based syllable boundary detectors across the emission vectors extracted from three Transformer ASR models (small, medium, and large). The x-axis reports the depth of the Transformer ASR model layer from which vectors were extracted. The three charts correspond to different LSTM hidden-layer lengths, i.e. a 160, b 320, and c 740

We also tested the statistical significance of the measured performance differences. After fixing m and n, we cross-compared the \((L_{n,m,l_i}, L_{n,m,l_j})\) WCRs for all i and j layers (with \(i\ne j\)). Significance tests were two-tailed tests with the null hypothesis that there was no significant WCR difference. For example, Table 3 reports all significance tests for a syllable boundary detector with a 320 hidden-layer length using the emissions of the Small Transformer ASR model (all other tables are reported in Appendix). Columns Sys 1 and Sys 2 indicate the \(l_i\) and \(l_j\) indices. The Win column indicates which detector achieved the highest performance. The Relevance column classifies the minimum significance level p-value as ‘*’ (\(p=0.001\)), ‘**’ (\(p=0.01\)), ‘***’ (\(p=0.05\)), or non-significant (empty). Therefore, the most significant discrepancies were those indicated with one ‘*’. The table demonstrates that the LSTM with a 320 hidden-layer length achieved the highest performance using the emissions of the 4th-index encoding layer of the Small Transformer ASR model. This performance was significantly higher than the one achieved using the other encoding layers. The comparisons across all n and m values confirmed that the 4th and 5th Transformer ASR model’s encoding layer indexes always corresponded to the highest and most significant WCRs.

Table 3 Summary of the pairwise statistical significance tests between LSTM-based syllable boundary detectors with a 320 hidden layer length, trained on the features vectors extracted from the Small Transducer ASR model

4.2 Optimal model identification and energy-pitch characterisation of the classifications

4.2.1 Considered metrics and measurements

We measured the overall performance of the \(L_{n,m,l}\) models after fixing l to the most informative emission layer for syllable boundary detection per (nm) pair. We used standard measurements (Accuracy, Precision, Recall, F1) based on the experts’ corpus annotations. We also used Cohen’s kappa to measure the agreement between the manual and automatic annotations with respect to the chance agreement. In this comparison, true positives (TPs) were 40-ms segments where both the manual and automatic annotations indicated the presence of a syllabic boundary. Likewise, true negatives (TNs) were segments where both the annotations indicated the absence of a syllabic boundary. False negatives (FNs) were segments where only the manual annotation indicated a syllabic-boundary presence. Finally, false positives (FPs) were segments where only the automatic annotation indicated a syllabic-boundary presence. In summary, the following performance measurements were used:

$$\begin{aligned} \textrm{Accuracy}&= \frac{\textrm{TP}+\textrm{TN}}{\textrm{TP}+\textrm{TN}+\textrm{FP}+\textrm{FN}}\\ \textrm{Precision}&= \frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}\\ \textrm{Recall}&= \frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}\\ \textrm{F}1&= 2 \cdot \frac{\textrm{Precision} \cdot \textrm{Recall}}{(\textrm{Precision}+\textrm{Recall})} \end{aligned}$$

4.2.2 Evaluation

We used Accuracy and F1 as the principal measurements to identify the optimal model because Accuracy calculates the overall fraction of correctly detected boundaries, and F1 summarises Precision and Recall through their harmonic mean. Generally, high Accuracy was measured for all detectors, i.e. they could extract valuable syllable boundary-related information. The assessment indicated that the overall optimal model was an LSTM-based model with a 320 hidden layer length, operating on the output of the 4th layer index of the Small Transformer ASR model (Table 4). The kappa agreement was “good” according to Fleiss’ classifications [107] for all detectors but was slightly better for the optimal model (0.54). The optimal model achieved a lower Recall than the other models because of a higher number of false positives. However, the model compensated for the Recall loss with a higher Precision, resulting in a higher F1 overall.

Table 4 Summary of the performance of our syllable boundary detectors reported per LSTM hidden-layer length. Each row reports the corresponding Transformer ASR model used and the optimal encoding layer used for feature extraction. Red numbers indicate the highest values for each measurement. The overall optimal model is highlighted in green

As an additional step, we characterised the optimal model’s classification categories (TP–TN–FP–FN) over 40-ms segments by studying their average energy and pitch-level distributions. Energy is here intended as the squared sum of the signal-segment samples divided by the total number of segment samples (signal-segment power). Pitch, a rhythm-related feature, was estimated as the average pitch of 10-ms windows within the 40-ms classified segments. It was calculated through Boersma’s sound-to-pitch algorithm [108] within a 60–250-Hz frequency band. The energy and pitch distributions across the classification categories allowed for characterising specific and shared properties of these categories (Table 5 and Fig. 5). Generally, the TPs corresponded to segments with higher energy than FPs (+ 31%) and TN (+ 14%) but had slightly lower energy than FNs (− 6%) (Table 5). Higher energy (+ 25%) was also observable for expert-annotated syllabic boundaries (corresponding to TP + FN) compared to non-annotated segments (TN + FP). Conversely, TPs corresponded to an averagely lower pitch than FPs (− 1%) but an averagely higher pitch than TNs (+ 18%) and FNs (+ 7%). However, the experts’ annotations presented an average higher pitch (+ 5%) in the syllabic-boundary segments than in non-syllabic-boundary segments. Notably, TPs fell in energy islands (signal segments characterised by an increasing onset, a nucleus, and a decreasing coda) of syllabic scale (100–200 ms) and with averagely double the duration of the TNs’ energy islands (40–100 ms).

Table 5 Proportions and relative variations of average energy and pitch in 40-ms length audio segments

A range of energy values over 34.4E5 mainly corresponded to TPs (Fig. 5a), which would enforce the classification confidence of these segments as syllabic boundaries, should energy be used as a weighting classification factor. FNs presented moderately high energy (11.48E5 median) and pitch (124.29 Hz median and 157.53 Hz at the 75th percentile) (Fig. 5b). High pitch in FNs was a distinctive characteristic compared to TNs (119.66 Hz median value and 147.70 Hz at the 75th percentile) that would allow for automatically revising the classification of non-syllabic boundaries. As for FPs, the corresponding segments presented a median energy comparable with the TNs’ energy (9.99E5 vs 9.64E5) but had lower median energy than the FNs (11.48E5). Therefore, energy was not a discriminant property of FPs. However, the FPs presented a generally higher pitch than TNs (129.64 Hz vs 119.66 Hz), which could help detect and correct some FPs.

Fig. 5
figure 5

Box plots displaying the distributions of a energy and b pitch values across true positive (TP), true negative (TN), false positive (FP), and false negative (FN) classifications

5 Discussion and conclusions

This paper has described a probing experiment for end-to-end Transformer ASR models based on automatic syllable boundary detection. Our goal was to verify if such architectures internally modelled a rhythmic component similar to what humans appear to do while processing speech. Syllable boundary detection was based on an LSTM processing the feature vectors extracted from a Transformer ASR model’s encoding layer. The most informative vectors were produced by the smallest-size Transformer ASR model and were optimally recognised through an LSTM with a 320 hidden-state length (medium size). Our syllable boundary detector also reached a higher accuracy (\(\sim\) 87%) than alternative systems for Italian [104, 109, 110].

One significant result of the present study is that our syllable boundary detectors’ performance depended on a rhythmic component modelled by the inner layers of the analysed Transformers, correlated with psycho-acoustic syllables. In fact, our evaluation highlighted that an acoustic component with high energy and long duration was primarily contained in the shallower Transformer’s encoding layers (\(\sim\) 4), fading out in deeper layers (\(\sim\) 16), and was valuable for automatic syllable boundary detection. This result suggests that the Transformer ASR models captured syllable separation (and rhythm, consequently) in the earliest stage of the encoding process, in agreement with studies that have explored automatic and human speech-processing similarities from a medical perspective [111]. It also aligns with other studies [76] that detected phone-level and word-level positive reactions in layers likely compatible with those we detected as reacting to syllables. A detailed analysis of the optimal syllable boundary detector output indicated that the true-positive classifications were associated with highly energetic boundaries within syllabic-scale energy islands (100–200 ms) having double the duration of the true negatives’ duration. This observation indicates a correlation between the detected syllable boundaries and syllabic prominence [15]. The wrongly classified boundaries (false positives) had similar medium–low energy profiles to true negatives but an averagely higher pitch. Therefore, a high pitch and medium–low energy segment should lower the syllable boundary detection confidence, whereas a high energy segment should increase the detection confidence [104, 110].

Among the missed boundaries (false negatives), a subset was characterised by higher energy and pitch than true negatives. These cases might correspond to the boundaries of stressed syllables at the end of words (present in Italian and Spanish). Therefore, they could be due to the discrepancy between the Transformer ASR model and the syllable detector training languages. One point of discussion is indeed the consequence of training the Transformer in English and the syllable boundary detector in Italian and Spanish. The representation formed in the Transformer’s shallower layers was related to English syllables, i.e. to the specific energy, length, and pitch profiles of a stress-timed language. Conversely, Italian and Spanish are syllable-timed languages. This discrepancy mainly increased the number of false negatives, although not enough to compromise the overall performance. The underlying reason was likely that the emissions contained an important rhythmic component that was language-independent.

In this work, we have focused on syllable units rather than phonemes or words since the size of the considered analysis windows does not allow for capturing phonemic-related characteristics [76]. On the other hand, the encoding layer does not contain word-level information. In the future, we will consider finer- and coarse-grained units for analysis in more extended models. Moreover, we will explore how performance might change when all training sets belong to the same language. We will also study whether, in these conditions, true positives mostly correspond to long and intense syllables (i.e. to syllabic acoustic prominence). Having a way to detect prominent syllables would be crucial to improve syllabic ASR models’ performance while drastically reducing the training set dimension [5] and would help refine the perceptive and acoustic definition of syllable [15].

Our results create an interesting parallel between human speech recognition, relying on psycho-acoustic syllable-related units, and DL-ASR internal processing. They provide insight and location about human-explainable processes inside E2E-ASR systems related to the formation of syllabic-scale unit representations. Other scientific studies have also conjectured that the internal knowledge representation formed in deep learning models can produce new definitions of speech units and emerging dynamics similar to the internal human brain’s speech representations [112, 113]. However, it is difficult to understand the influence of the spontaneously formed speech units on the E2E-ASR performance due to the large number of parameters, training material, and the diverse training methodologies used [22, 114]. One common research question in this context is if we can learn from the psycho-acoustic-like dynamics in ASRs to enhance other ASRs’ performance and efficiency. This question has been investigated in under-represented languages, limited-vocabulary ASRs, and robust spontaneous speech recognition in noisy environments [13, 22, 28, 115,116,117]. The question gains more interest if we highlight the gap between conventional ASRs (which use an explicit speech-unit and language modelling) and E2E-ASRs as the number of parameters used by the two system types. For example, the Wav2Vec2.0 E2E-ASR requires \(10^8\) parameters to overcome the performance of a conventional ASR (with \(10^2\) parameters) on a large-vocabulary recognition task [114]. Performance is lower than conventional ASRs’ performance when using \(10^7\) parameters. The combination of low complexity and high performance in conventional ASRs is due to their explicit modelling approach. However, new solutions might exist in the \(10^6\) gap of used parameters, which could rely on deep learning architectures using information from conventional ASR modelling [5]. Multi-channel E2E-ASRs are exploring this possibility by injecting supra-segmental and non-verbal characteristics in encoding layers to enhance noisy-speech recognition [118, 119]. They usually represent these characteristics as additional (latent) variables or pre-trained sub-models [120, 121]. Other approaches use the vectors extracted from the hidden layers of large E2E-ASRs (distilled features) to train smaller ASRs and achieve higher performance on specific tasks [122]. Conversely, other systems use distilled features instead of standard acoustic features to improve the performance of conventional ASRs [33, 123, 124].

Our experiment identified a particular type of distilled features related to rhythm and syllables that can be used in other ASRs. These features are suitable for Few-shot Learning, i.e. to make an ASR model generalise over new data categories using limited training data [125]. Distilled features similar to the ones we detected have indeed shown potential to reduce the hypothesis space, avoid overfitting, ensure heterogeneity in the prediction space, and consequently improve ASR effectiveness over the small datasets available for low-resourced languages and applications [126]. For example, they have been used as prototype vectors for internal encoding classes (e.g. speech units) to enhance class centroid representations and achieve better generalisation [127,128,129]. Moreover, they have been proposed to focus a Few-shot Learning model on islands of prominent-speech segments having high-quality pronunciation and thus being more clearly recognisable [5, 130].

Generally, ASR performance improvement deriving from integrating syllabic-scale features has been long reported and inspired the present work. Acoustic features enriched with syllable-boundary information or syllabic-scale features can sensibly improve continuous and spontaneous automatic speech recognition, especially in high noise and reverb scenarios [131,132,133,134]. Moreover, syllabic-scale features derived from deep learning models are critical for diagnostic systems based on prosodic information, such as those for pathological speech detection in syllable-timed languages [135, 136]. Recent studies have also highlighted the centrality of these features in contexts where prosody is the primary information source, such as infant cry detection and classification [137, 138]. The highly prosodic nature of infant cry indeed makes syllable-scale acoustic features central for these tasks, especially when extracted through deep learning models [99], and allows interpreting a newborn’s psychological and clinical status [139,140,141]. Finally, another field of application of syllabic-scale features is the improvement of ASR robustness to adversarial attacks (e.g. hidden voice commands), which requires introducing new paradigms for attack evaluation [142,143,144]. Some studies have indeed highlighted that syllabification (which can be based on syllabic-scale features, like in our case) is critical to discovering potential attacks and consequently improving ASR robustness [145, 146].

In summary, all mentioned application cases would likely benefit from syllable-related distilled features, e.g. those extracted from the 4th-index encoding layer of the present paper’s 320 hidden-state Transformer ASR. In future experiments, we will verify this statement in all mentioned contexts.