1 Introduction

A phone is a sound of speech, and a phoneme is an abstract representation of a phone. The main difference is that phones are characterized by physical features, such as the distribution of energy in frequency bands, while phonemes have a linguistic descriptions as speech elements. In other words, phones can be extracted from speech by electronic devices while phonemes are distinguished by the sense of hearing supported by linguistic knowledge. Phones and phonemes are not uniquely assigned to one another. From the physical point of view, speech signals are strongly distorted by the speaker’s individual characteristics such as: sex, age, intonation, and emotional state. Additional distortions result from co-articulation effects as a significant impact of neighbouring phones. All these phenomena strongly affect the physical properties of phone articulation. It is therefore reasonable to ask how, in spite of these distortions, speech signals are accurately analyzed by the human brain, and how electronic devices can be improved to enhance voice communication between humans and computers.

Quentin D. Atkinson in his article [2] suggested that the founder effect may operate on human languages. It means that expansion should progressively reduce phonemic diversity with increasing distance from the point of origin. His model points to central and southern Africa as the location of where the first languages may originate. Atkinson examined geographic variation in phonemes using data taken from 504 languages described in the World Atlas of Language Structures (WALS). His article [2] provoked immediate criticism (e.g. [22]) and had numerous citations. His opponents suggest that taking into account historical processes like migrations, conquests, and borrowings would explain language evolution more credibly than the founder effect solely. The scientific controversy about Atkinson’s hypothesis motivated us to conduct an independent study to assess his suggestions. Unlike Atkinson [2], our approach was based on analyzing phones instead of phonemes.

Every language articulation exploits only a small part of innate human abilities. Young children are able to learn a spectrum of sounds broader than those existing in any particular language. Their individual articulation abilities are shaped by the culture that motivates children to master some phones and lose the ability to produce others at the same time.

We compared phones for 239 languages spoken by about 96% of the world’s population. Our approach is an unsupervised speech research study. Unsupervised methods make it possible to analyze any language without prior linguistic knowledge. These methods try to mimic the way in which human sense analyze speech and infants learn language by simply being exposed to it.

Most approaches to automatic partition of speech into separate units do it in two steps [14]: segmentation joined with parametrization, followed by clustering. We also used this approach.

Frequency analysis is the first step in speech processing, by people and usually also by electronic devices. Computer analysis makes it possible to partition speech signals into segments which are characterized by relatively constant energy distribution in the frequency domain. Additionally, more precise frequency analysis makes it possible to parametrize phones. Next, these parametrization were used to determine the probability density of phone distribution in the 11-dimensional frequency domain.

Cluster analysis aims to reveal similarities between related phones collected in a data set [10]. Groups of similar elements (frequently associated with different variations of the same speech segment) were extracted by the clusterization of phones.

For any language the number of clusters which group similar phones and acceptable phone deviations in the frequency domain is not precisely defined. These two quantities, however, are closely tied. The greater the value of permissible deviations within the clusters, the smaller is the number of clusters identified with different representations of hypothetical phone representatives. The main goal of our publication is the presentation of experimentally determined dependencies of the number of phone representatives from permissible changes inside the clusters that group similar acoustic elements. The nature of this dependence is the same for all languages, although some characteristic differences are visible. We used these deviations as the basis for ranking languages.

Frequency analysis of phones allows us to calculate spectral properties in order to compare world languages. Such analysis provides information about languages from an articulation point of view [11, 18]. It is natural to expect different pronunciations between different languages. Computer analysis uses signal processing methods to find the frequency properties of speech. Phone comparison between languages brings new and sometimes unexpected conclusions. Precise analysis of multi-linguistic speech aims to provide answers to the following question: how different are phones used in different languages and what are the individual features which characterize phone distributions?

This paper consists of five main parts. Chapter 2 introduces the database used to analyze the languages of the world. The next two chapters briefly describe the frequency method for automatic extraction of phones and their parametrization. Chapter 5 provides the basics of the clusterization method. The most important part of the paper is presented in Chapter 6, where we propose two methods for language characterization. They based on the dependence of the number of clusters from the Ward’s distance obtained during the hierarchical clustering. Chapter 7 presents the results of calculations and suggestions how they can be interpreted. Chapter 8 concludes the paper.

2 Data acquisition

Vast volumes of speech recordings are not transcribed and do not have time annotations. Adding such annotations is an expensive and time-consuming process. Our motivation to develop a universal method for automatic extraction of phones form non-annotated speech is a need to compare phones of vast number of languages which do not have transcribed training data corpora. Therefore, fully automatic segmentation followed by phone analysis is extremely useful.

The diversity of languages can be verified by a computer analysis of speech recordings. To analyze frequency features of languages it is necessary to collect speech samples for hundreds of languages. Gathering speech recordings of appropriate quality and length is not an easy task. Results of analysis can be relevant if the duration of recordings for each language are sufficiently long. We have not found a database with speech recordings, created for scientific research and containing several hundred languages of the world. The Global Recordings Network (GRN) website [6] is a source of vast volumes of language recordings. GRN is a provider of Bible audio materials in 3563 languages and dialects, making the database a vast linguistic resource. The uneven quality of recordings is a drawback, since the database was not created for scientific research. However, the recordings have been used for linguistic research into subjects as rhythm and phonological characteristics [4], developing and testing computer systems to recognize languages [3] and for documenting and reviving rare languages [17].

Languages were chosen for further processing based on recording length and number of native speakers. From the top 300 languages which were analyzed in [22], we selected 239 language to enable us to compare our results with other approaches.

Language recording length makes it possible to extract at least a few thousand segments for each language, up to almost two million for English and Mandarin. To make computation feasible, the number of segments for further processing was restricted to two hundred thousand segments randomly selected from language data.

3 Segmentation

The vast majority of speech processing methods need segmentation of speech signals [5]. Uniform segmentation is used most commonly, but many studies relate to the non-uniform segmentation of: phones [8, 24], syllables [13], words and other elements [1, 15, 20]. The large variety of segmentation issues determines the multitude of algorithms and the publications which present them. We focused on methods based on wavelet transformation (e.g. [21, 24]).

The continuous nature of speech makes segmentation uncertain. Moreover, various acoustic segments may represent a single phonetic segment and vice versa. In [18] a phone segmentation based on frequency features detected in a speech signal was compared with a segmentation created by human transcribers.

The first stage of our speech analysis is extracting segments corresponding to phones. We used segmentation developed by Ziółko et al. [24]. This spectral method is based on the wavelet packet transformation which splits the speech signal into seven frequency bands. Each fraction is separated by digital low-pass and high-pass filters. Low frequencies have narrow bandwidths and are investigated with a finer resolution, while high frequencies have wide bandwidths, resulting in a lower resolution. The frequency ranges of the seven bands are: 0.5-1 kHz, 1-1.5 kHz, 1.5-2 kHz, 2-3 kHz, 3-4 kHz, 4-6 kHz and 6-8 kHz. In practice the boundaries between these bands overlap because digital filters do not have perfect frequency characteristics. Such speech analysis in the frequency domain corresponds to a perceptual scale.

The role of the segmentation algorithm is to detect significant transitions of energy among the frequency bands. Boundaries of phones are detected based on local changes in energy distribution. This method is universal enough to handle any language. We verified experimentally that having more than seven frequency bands increased the number of segments in comparison with manual segmentation.

Figure 1 is an example of speech segmentation based on energy distribution in seven frequency bands. The upper plot shows the wavelet time-frequency representation of speech signal presented in the lowest part of Fig. 1. The Meyer wavelet of the 11-th order was used. The other two figures show the rank map and event function. The rank map shows the size of energy changes in the frequency bands. The event function presents the global importance of changes in energy distribution.

Fig. 1
figure 1

An example of segmentation based on the time-frequency analysis of speech signal

4 Parametrization

Phones are treated as quasi-stationary segments. We assumed that the majority of phone identity information is concentrated in the centers of the segments. The parameters were calculated for speech segments scaled by the Hamming window to minimize co-articulation effects. Analysis was carried out by applying similar discrete wavelet packets as for segmentation, but with more frequency bands. Phone parameters were calculated as the average energy in eleven frequency bands (see Fig. 2). This way, every extracted phone was characterized by the time stationary vector in the 11th dimensional frequency domain. Details are presented by Zióko et al. [24]. Such frequency analysis is similar to the commonly used MFCC method. In both approaches, the analysis is carried out on frequency subbands with variable width. The most important difference is the lack of triangular windows and smaller overlapping ranges in our method.

Fig. 2
figure 2

Frequency bands of Wavelet Packet Decomposition for phones parametrization [24]

5 Clusterization

The clusterization algorithm involves creating a Gaussian Mixture Model (GMM) to approximate the probability density distribution of phones in an 11-dimensional frequency space. This approach is justified by the common use of GMM in speech modelling. We chose 1024 components (frequently used in other speech applications), which is significantly higher than the expected number of phone representatives in any language. Phone component groups were created by GMM hierarchical clustering. A similar approach to clustering was presented in [7]. Differences between components were calculated as Euclidean distances between expected values, Ward’s algorithm [23] was then used in a hierarchical clustering procedure.

GMM is associated with the probability density function

$$ p(x) = \sum\limits_{k = 1}^{K} \alpha_{k} \mathcal{N} \left( x | \overline{x}_{k}, {\Sigma}_{k}\right), $$
(1)

where αk is the mixture weight and K is the number of components equal to 1024 in our case. The multivariate Gaussian density distribution has the form

$$\begin{array}{@{}rcl@{}} \mathcal{N} \left( x | \overline{x}_{k}, {\Sigma}_{k}\right) = \frac{1}{\sqrt{(2\pi)^{11}|{\Sigma}_{k}|^{1/2}}} &\exp \left( -0.5\ (x - \overline{x}_{k})^{T} {\Sigma}_{k}^{-1}(x - \overline{x}_{k})\right), \end{array} $$
(2)

where the observation \(x\in \mathfrak {R}^{11}\) is a cosine transform of a vector representing the energy distribution for a phone and |Σk| is the corresponding determinant. Cosine transformation allows us to obtain diagonal covariance matrices Σk. Finally, the GMM model of phone distribution in the frequency domain is represented by weighting coefficients αk and the parameters of Gaussian functions: \(\overline {x}_{k}\) and \({\Sigma }_{k}^{-1}\).

Figure 3 presents the hierarchical clusterization of GMM components for English. The dendrogram shows the dependency of grouping GMM components in clusters and the cut-off point of Ward’s distance for 34 phone representatives.

Phone clusterization makes it possible to determine the statistical relationship between phones and phonemes for annotated speech samples. Such experiments showed that pure frequency analysis does not lead to credible mapping between acoustic units (phones) and linguistic transcriptions (phonemes). Left and right context information plays a vital role in accurate phone recognition.

Fig. 3
figure 3

Results of hierarchical clustering of GMM model for English. The doted line represents the cut-off Ward’s distance for 34 phones. For clarity, the bottom part of the dendrogram (with 1024 leaves) is not shown

6 Language differences based on the clustering procedure

The number of clusters c depends on an assumed admissible diversity ρ of elements within the clusters. It decreases if a greater diversity in each cluster is allowed. Figure 4 shows examples of the relationship between the number of clusters and Ward’s distance. These plots display the wide range of changes in the number of clusters. A distinctive visual property is the convergence of all the charts for a small and large number of clusters. The most significant differences appear if the number of clusters is in the range typical for the number of phonemes assigned to languages. It is generally assumed that the average number of phonemes for world languages is around 34.

Fig. 4
figure 4

Number of clusters vs. the cut-off distance (compare with Fig. 3)

The experimental results characterized by Fig. 4 can be analyzed in many different ways. An important advantage is the ability to precisely approximate experimental data by the equation

$$ c(\rho)=a_{1} e^{-b_{1} \rho} + a_{2} e^{-b_{2} \rho} , $$
(3)

where a1,a2,b1,b2 are parameters chosen separately for each language. We fitted relationship (3) to experimental data for the range of cluster numbers from cmin = 1 to cmax = 512. If the number of clusters is equal to the number of Gauss functions i.a. c = 1024), then each cluster contains one element only and the largest distance inside the clusters is equal to 0. This means that all curves shown in Fig. 4 must end at the point: distance = 0 and #Clusters = 1024. There are no differences between languages, so this is not interesting. Model (3) proposed by us is a good representation of experimental data ranging between 1 and 512 clusters. This area is important for language differentiation. Including data for more clusters than 512 would reduce the visibility of differences between languages.

Adjusted R-square statistics was used to verify the mathematical model quality for each language. For the case of mathematical model (3) with four parameters, adjusted R-square statistics for i-th language has the form

$$ \overline{R}^{2}_{i}= 1-\frac{(1-{R^{2}_{i}})(J-1)}{J-5} $$
(4)

where

$$ {R^{2}_{i}}= 1-\frac{{\sum}_{j = 1}^{J}(c_{i,j}-c_{i}(\rho_{i,j}))^{2}}{{\sum}_{j = 1}^{J}\left( c_{i,j}-\frac{1}{J}{\sum}_{j = 1}^{J}c_{i,j}\right)^{2}} $$
(5)

and J = 512 is the number of cluster changes, ci,j is the number of clusters when Ward’s distance is not greater than ρi,j, while ci(ρi,j) is the value of (3) for ρi,j.

The mathematical model (3) for English is characterized by the Root Mean Squared Error RMSE= 2.97 and \(\overline {R}^{2}= 0.9996\). For other languages, the fitting parameters are similar. The worst match was observed for Spanish, we obtained RMSE= 6.25 and \(\overline {R}^{2}= 0.998\).

There are languages which have a low frequency diversity, while in other languages differences between elements in clusters are significantly more noticeable. The relationships between the number of clusters c and the allowed distance ρ for two selected languages are presented in Fig. 5. The examples shown in this figure represent languages having extreme properties in the distribution of phones.

Fig. 5
figure 5

Number of clusters as a function of the cut-off Ward’s distance for Arabic and Mandarin

The area

$$ A={\int}_{0}^{\infty} c(\rho) d\rho=\frac{a_{1}}{b_{1}} + \frac{a_{2}}{b_{2}} , $$
(6)

under the curve defined by (3) can be taken as the characteristic parameter for each analyzed language. The advantage of this scalar factor is its simple dependence on experimental parameters a1,a2,b1,b2 characterizing the selected language. Small values of (6) indicate a high decreasing of function (3). In this case, a relatively small change in Ward’s distance results in a significant change of cluster number. This means a small variety of articulated phones. Therefore parameter (6) characterizes the frequency diversity of phones. This means that (6) can be used for the ranking of languages.

The clustering procedure starts from 1024 components, because this number of Gauss functions was used to approximate the probability density. The number of clusters decreases as a result of the implementation of Ward’s algorithm. A pair of variables is successively obtained: the number of clusters and the maximal Ward’s distance between elements within the clusters. This is shown in Fig. 4 for six selected languages.

Assuming that the number of phones is equal to the number of phonemes assigned to the analyzed language, the diversity of phones can be assessed. On the basis of linguistic data it is possible to determine the average value of phonemes for main languages. From data contained in [22] the expected value is slightly above 34 phonemes. For this number of clusters Fig. 5 shows significant variations between languages.

Assuming the number of clusters equal to 34 for all languages, it is possible to systematize them and group languages in terms of similarity. Let set

$$ P=\{\rho_{i} : c_{i} (\rho_{i}) = 34\}_{i = 1}^{239} , $$
(7)

groups characteristic distances for 239 languages being compared. Values of ρi depend on a frequency variety of phones. They can be determined directly from the clustering procedure, so they do not depend on the quality of mathematical model (3).

Figure 6 presents the flowchart of calculations provided for each language separately. Most of the calculations: DWT parametrization, GMM training, clustering and curve fitting is done using built-in MATLAB functions. Implementation of speech segmentation algorithm was obtained from authors of [24].

Fig. 6
figure 6

Flowchart of calculations to determine the mathematical model (3) for tested language

7 Experiments

Both indicators (6) and (7) can be used to assess the diversity of phones for the analyzed languages. The indicator (7) is calculated directly from the results of the clusterization. However, it is sensitive to local deviations and therefore the indicator (6) seems to be more accurate for the ranking of languages.

Table 1 presents two indicators which characterize the chosen 50 languages. The first indicator is defined by (6) and its values are presented in second column. The second indicator is Ward’s distance assuming that the number of clusters is equal to 34. Values of this indicator are defined by (7) and are shown in the third column of Table 1. Both indicators are measures of phone diversity in the frequency domain, therefore they should be correlated. For the 239 analyzed languages the correlation coefficient is equal to 0.72. The languages are ordered from the highest to the lowest value of indicator (6). It means that languages which have relatively major differences in articulation are shown at the top of Table 1. This group includes Arabic and Punjabi. In contrast Mandarin is characterized by the lowest variation in phones articulation.

The other four columns present coefficients of mathematical model (3). This model is the sum of two exponential functions. The initial values of the first functions are approximately ten times higher (a1 in relation to a2), but their decay rates are approximately six times higher (b1 in relation to b2). As a result, the second components of the model (3), determined by parameters a2 and b2, have a greater impacts on modeling effects for ρ > 10.

The last two columns of Table 1 present RMSE and adjusted R-square statistics (4). The value of index (4) is equal to 1 if the mathematical model provides a perfect approximation. The data presented in the last column of Table 1 indicates very good usability of model (3). The next to last column presents RMSE values. For all languages these errors concern the number of clusters c which vary from 1 to 512. This index is more sensitive and it makes it possible to differentiate modeling efficiency when all results are very good.

Table 1 Ranking of languages according to phones diversity in the frequency domain

If we assume that the number of clusters is the same as the number of phonemes, then we can suppose that each cluster corresponds to a certain phoneme. To verify this hypothesis, experiments have been provided for languages taken from corpora with hand annotations. It appears that only 20% of phonemes were correctly allocated to clusters [12]. This observation is not surprising and was firstly noted around 60 years ago (e.g. [16]). Now, the hidden Markov models are used in automatic recognition systems to overcome these difficulties.

8 Conclusions

Our research was inspired by work carried out by Atkinson [2] in which he compared the phoneme diversity for 504 languages. Our main motivation was to find an acoustic similarity measure between languages that can lead to language taxonomies. We supposed that this measure could be used to verify Atkinson’s hypothesis about the presence of a founder effect in world languages. The comparison of the language ranking obtained by us with the results of Atkinson’s work does not confirm his hypothesis. Our experiments support the views of Atkinson’s adversaries, claiming that various factors conditioned by historical processes have a decisive influence on the diversity of articulation. These phenomena have a major impact on the evolution of languages. Their relatively high rate of change is clearly signalled in other studies, e.g. [9].

The data we obtained can be correlated with the geographical location of languages; additionally, there may be other phenomena which have a significant impact on the size of the differences in phone pronunciation. This direction of research could lead to interesting conclusions.

Our main conclusions arise from the analysis of data showing the relationship between the number of clusters and their internal differentiation. There are no clear isolated clusters in the frequency space. This makes it possible to fine-tune the continuous curve (3). However, Fig. 5 shows the existence of 24 visible isolated clusters for Chinese and fewer isolated clusters for Arabic.

The method of clustering we used is frequently applied in various types of scientific research. A great simplification is the availability of ready-made computer programs. Calculating the differences between components remains an open question, better methods than the Euclidean distance may exist.

Frequency analysis of phones is not sufficient to reliably determine phonemes in speech recognition systems. Although both types of frequency analyzers, the sense of hearing and electronic devices - operate efficiently, they cannot remove distortions from speech. The human brain and computer analysis (supported by trained models) play a highly important role in speech recognition.

We used the frequency variety of phones to rank the order of languages. We ranked languages from those where frequency diversities between phones are significant to languages where these differences are significantly smaller.

Major differences in the articulation of phones may involve languages spoken by non-native speakers i.e. people with diverse cultural backgrounds. Secondly, it seems that major differences in articulation make learning foreign languages easier.

Smaller differences between phone articulation may be due to fast speech. Schupert et al. [19] verified the hypothesis that differences in speech tempo are the main reason why spoken Danish is so difficult to understand for Norwegians and Swedes. Differences between Danish and Swedish, shown in Table 1, support this conclusion.

Data for all 239 tested languages is available from http://www.dsp.agh.edu.pl/_media/pl:research:language_ranking.pdf.