1 Introduction

The problem of speech rehabilitation of patients after surgical treatment of cancer of the oral cavity and oropharynx is relevant. The higher incidence of these diseases says in favor of the relevance of this problem. In 2014 in Russia the incidence of cancer of the oral cavity and pharynx was 5.6 per 100 thousand. The prevalence \(- 36.5\) per 100 thousand. Thus, each year in the country revealed near 13,000 new cancers oropharyngeal localization, and the total number of patients suffering from this disease calculated over 53,000 people [1, 2]. One of the main problems in the treatment is the need of learning the speech of patients after partial or complete surgical removal of some organs of speech production path, for example, tongue. During rehabilitation it is necessary to estimate the quality of the patient’s speech. Until recently, this problem was solved only by subjective evaluation of speech quality. In previous studies we have proposed a method based on the use of GOST R-50840-95 Speech transmission over varies communication channels. Techniques for measurements of speech quality, intelligibility and voice identification [3]. This technique allows to obtain a quantitative assessment of the quality of speech, for example, syllable intelligibility. However, to obtain objective assessments in the framework of this method requires an estimate of not less than 5 auditors. In terms of the actual process of voice rehabilitation, this requirement is at least exigeant. There is a task of automation of the patient’s quality of speech estimation while minimizing the participation of a speech therapist. Such assessment may be obtained by comparing the reference speech (speech of patient before the surgery), and the speech in the rehabilitation process. This approach solves the problem of taking into account the characteristics of different speakers - comparisons are made only in one speaker, that simplifies the solution of the problem. This paper presents one of the steps to solve this problem - the formation of the quality criteria of pronouncing certain phonemes.

2 The Current State of Research

In previous stages of research was carried out the analysis of groups of the phonemes at the greatest change at a surgical treatment of cancer of the oral cavity and oropharynx. Using developed software for automation the evaluation of speech quality on the basis of GOST R 50840-95 [4, 5] it have been received the list of the phonemes at the greatest change, namely [r], [t], [s], [f], [k], and also their softened options. In many cases, there was a modification of the softening feature. Comparison of the received list with classical sources, that contains information about the phonemes at the greatest change by this disease, was carried out [6] (referencing to earlier work of the authors team I. Bolov, M. Solovyev, L. Dushak, D. Podgornykh, A. Shenderov, 1974.), high extent of coincidence confirms reliability of the obtained data. Besides, it has been made a preliminary research of Fourier spectrums of phonemes which are most prone to change that has allowed to make the assumption of the greatest susceptibility to change of the top part of a spectrum (areas with a frequency of 2 kHz and above). The example of these spectrum is presented in the Fig. 1. On the basis of the analysis of the received results the task on the current stage of implementation of the project has been formulated.

Fig. 1.
figure 1

Fourier spectrogram for russian syllable [ʨtaɫ] (left half). Top part of figures contain spectrogram for syllables records before surgery, down part spectrogram for syllables records after surgery. Fourier spectrogram for russian syllable [səs’] (right half). Top part of figures contain spectrogram for syllables records before surgery, down part spectrogram for syllables records after surgery. The X axis - the time in seconds, Y - frequency in hertz, the upper limit of 5000 Hz

3 Setting Objectives for Research

In this stage it was proposed to form a simple measure to estimate the differences between the normal pronunciation of sounds (as a reference are used the sounds obtained from patient records syllables recorded in the preliminary examination before surgery) and sounds in process of rehabilitation. The first step is to make assumptions based on the analysis of the spectra of healthy speakers, who in the first instance pronounced syllables in normal mode, and in the second - with the minimization of the use of language in speech production. It was set the task of definition of localization of the used range to form a measure of distinction: does it better to use the whole syllable or changed phoneme only, as well as the whole spectrum or only the upper part. The task of the choice of a concrete measure of distinction at determination of distance between the received ranges is also set. As part of this task is carried out research of the application of the Minkowskian distance (1) at different values of parameters. At the same time are considered special cases of this distance - the Euclidian distance (2) and the Manhattan distance (3) \((\text {p}=2, \text {p}=1)\) [7]:

$$\begin{aligned} \rho (x,y) = \left( \sum _{i=1}^n |x_i-y_i|^p\right) ^{1/p}, \end{aligned}$$
(1)
$$\begin{aligned} d(x,y)=\sqrt{\sum _{i=1}^n (x_i-y_i)^2}, \end{aligned}$$
(2)
$$\begin{aligned} m(x,y) = \sum _{i=1}^n |x_i-y_i|, \end{aligned}$$
(3)

where x is a first signal, y is a second signal, i is a position in the signal, \(\rho \) is the Minkowskian distance, d is the Euclidian distance, m is the Manhattan distance, p is a parameter of Minkowskian distance.

Minkowskian distance is used to solve the problem for different values of its parameters. The following values of parameter p are selected: \(\text {p} = [- 10\, -9.9\, -9.8~...~9.9\, 10]\). This involves a consideration of the practical application of these values, in spite of the fact that for \(\text {p}<1\) considered distance is not a metric. After receiving the values of distances between different implementations of phonemes in the normal and modified pronunciation it was carried out consideration of some preliminary approaches to formation the criterion of phonemes pronouncing quality on the basis of the distance. The details of these criteria and their characteristics are given in the appropriate section. It is carried out the assessment of the received results at various parameters of determination of Minkowskian distance on the basis of the offered criteria and also it is made the preliminary choice of parameter for practical use. After the choice of a measure for a distance assessment between realization of syllables or phonemes, and also the analysis of criteria for evaluation of quality of pronouncing it is carried out an inspection of the received assumptions with use of records of real patients and it is made the decision on applicability of the offered approach.

4 The Basic Signal Processing Steps in Determining the Quality of the Pronunciation of Syllables or Phonemes

Speech signal processing can be represented as a sequence of performing the following steps:

  1. 1.

    normalization of all studied speech signals on duration;

  2. 2.

    normalization of all studied speech signals on signal power;

  3. 3.

    definition of the Fourier spectrum [8] of all signals. The calculation is carried out with an analysis window of 256 samples and the offset between the windows in a 1 count;

  4. 4.

    determination of paired distances between all signals on the basis of Minkowskian distance;

  5. 5.

    calculation of an assessment of quality of pronouncing of a syllable or phoneme on the basis of the analysis of paired distances between various realizations of a syllable (phoneme).

As a result of this procedure it turns out the criterion or a set of the criteria allowing to estimate quality of pronouncing a syllable (phoneme).

5 Preliminary Normalization of a Speech Signal

5.1 Normalization of All Studied Speech Signals on Duration

It is carried out a reduction of each phoneme to duration equal 0.050 s with application of interpolation. Using of this value is based on the fact that there is no loss of information because it is certainly more than the duration of a single phoneme.

5.2 Normalization of all Studied Speech Signals on Middle Power of Signal

It is carried out a reduction of signals to identical power if they contain identical number of phonemes. Coefficient of normalization can be defined by square root of Middle power of a signal. Middle power of a signal is determined by a formula [9]:

$$\begin{aligned} MP=\sum _{i=1}^n \frac{A_i^2}{n}, \end{aligned}$$
(4)

where A is the amplitude of signal on descrete number i, MP is the middle power value, n is the length of signal.

This normalization is non-obvious and disputable at application in the syllables consisting of several phonemes differing in the compared syllables. The reason is different contribution of phonemes to the overall energy value, but when used in a single phoneme this deficiency is absent.

6 Determination of Paired Distances on the Basis of Minkowskian Distance

Taking into account the processed spectrum of signal and the normalization of the spectrum length the measure takes the form below:

$$\begin{aligned} l(x,y) = \left( \frac{\sum _{j=1}^{n_f} \sum _{i=1}^{n_t} |x_{ij}-y_{ij}|^p}{n_f \cdot n_t}\right) ^{1/p}, \end{aligned}$$
(5)

where x is the first specrum, y is the second spectrum, i is the number of time discrete, j is the number of frequency discrete, \(n_i\) is the count of time discrete, \(n_j\) is the count of frequency discrete.

Results of calculation are brought in the diagonal matrix presented in Fig. 2. In this case area \(S_1\) corresponds to comparison of the initial and modified signals, area \(S_2\) to comparison only of initial signals and area \(S_3\) to comparison only of the modified signals. \(n_1\) is a quantity of initial signals, \(n_2\) is a quantity of modified signals. \(l_{ij}\) is a distance between spectums of signals number i and j.

Fig. 2.
figure 2

View of matrix of distances between spectrum.

Average distance by each area are determined on the basis of this matrix using formulas:

$$\begin{aligned} \overline{l_1}=\frac{\sum _{i,j \in S_1} l_{ij}}{n_1 \cdot n_2}, \end{aligned}$$
(6)
$$\begin{aligned} \overline{l_2}=\frac{2 \sum _{i,j \in S_2} l_{ij}}{n_1 \cdot (n_1-1)}, \end{aligned}$$
(7)
$$\begin{aligned} \overline{l_3}=\frac{2( \sum _{i,j \in S_2,S_3} l_{ij})}{n_1 \cdot (n_1-1)+n_2 \cdot (n_2-1)} \end{aligned}$$
(8)

and also the minimum distance on area \(S_1\) and the maximum distance on area \(S_2\).

$$\begin{aligned} l_{1min}=min_{i,j \in S_1} l_{ij}, \end{aligned}$$
(9)
$$\begin{aligned} l_{2max}=max_{i,j \in S_2} l_{ij}. \end{aligned}$$
(10)

These values are used to form criterion of phoneme pronouncing quality. As criteria for evaluation of phoneme pronouncing quality is offered:

  1. 1.

    The ratio between average distance between the initial and modified signals to average distance between initial signals.

    $$\begin{aligned} Cr_1=\overline{l_1}/ \overline{l_2}. \end{aligned}$$
    (11)

    The closer this value to 1, the closer modified signal to initial and vice versa. Among the shortcomings can be noted the possibility of almost complete determination of the final value by stands out in big side values;

  2. 2.

    The ratio between average distance between the initial and modified signals to average distance between signals of one type.

    $$\begin{aligned} Cr_2=\overline{l_1}/ \overline{l_3}. \end{aligned}$$
    (12)

    The closer this value to 1, the closer modified signal to initial and vice versa. Among the shortcomings can be noted the possibility of almost complete determination of the final value by stands out in big side values. Use of distinctions between the modified signals is doubtful because Fof their smaller stability and, as a result, great values of distances. However the possibility of application of this criterion demands additional practical check.

  3. 3.

    The ratio between the minimum distinction between the initial and modified signals to the maximum distinction between initial signals.

    $$\begin{aligned} Cr_3=l_{1min}/l_{2max}. \end{aligned}$$
    (13)

    If this value more than 1, then obviously metrics for classes aren’t crossed and the farther they from each other, the better created criterion. However in reality the similar assessment is defined by extreme values dropping out of the general set and in practice will almost always be less than 1. Then, on the one hand, the closer this value to 1, the less an area of crossing of sets of value of distances for the initial and modified signals. With another, it isn’t considered the quantity or a share of the signals getting to this area.

  4. 4.

    A share of couples of initial signals between which distance exceeds the minimum distance between couples of initial and modified signals, and also a share of couples of initial and modified signals between which distance is less maximum between couples of initial signals.

    $$\begin{aligned} Cr_{41}=2 \cdot count_{i,j \in S_2} (l_{ij}>l_{1min})/n_1(n_1-1), \end{aligned}$$
    (14)
    $$\begin{aligned} Cr_{42}=count_{i,j \in S_1} (l_{ij}<l_{2max})/n_1n_2. \end{aligned}$$
    (15)

    Ideally shares have to be equal 0, is really the less value, the better because the crossings of distances of couple of signals getting to the area are leveled by application of the averaging criterion 1 (on condition of a small amount of such couples).

7 Analysis of the Signal of Healthy Speaker with Using the Proposed Approach

On this stage research conducted on records of one male and one female speakers. 10 records were made by every speaker, herein 5 first and 5 last syllables differed, but they contain the same phoneme in same part inside the syllable.

Further comparison was made for every individual speaker. Comparison of different speakers between each other obviously less important, because accounting of several factors in the same time (specific speaker and condition of speech formation tract), that lead to a change in pronunciation is problematic.

Below in Fig. 3 is shown the values of obtained criteria of quality for all area of studied values with step in 1 (on left half). Values of criteria are also presented for the most characteristic part on the results of the previous stage of the experiment (on right half).

Fig. 3.
figure 3

Values of criteria for \(p=-10\ldots 10\) (left) and \(-0.5\ldots 0.5\) (right)

At the result is possible to say, that most appropriate Minkovskian distance when the parameter p is between 1.6 and 3.1. Moreover, Fig. 4 shows a similar addiction, obtained by upper half of the spectrum (3–6 kHz).

Fig. 4.
figure 4

Values of criteria for \(p=-10\ldots 10\) (left) and \(-0.5\ldots 0.5\) (right)

Results showed that on this stage of considered range of values is sufficient to preselect the Minkovskian distance parameter and it not requires further expansion. The usefulness of localization in upper range of frequency, potentially more informative according to the results of preliminary experiments, it is not confirmed and requires further research.

Preliminary experiment on recordings of real patients was conducted to confirm the results. It is possible to talk about the correctness of the findings, but in this article is not enough space for its detailed description and it will be made in subsequent publications. In addition, the next step will be the using of mel-cepstral coefficients [10], linear prediction coefficients [11] and autocorrelation [12] for evaluation of speech quality. Automation of the segmentation of syllables into phonemes also is a problem for next stage of research [13].

8 Conclusion

As part of this work presents the results of phase for the formation of the quality evaluation criteria pronouncing phonemes by patient in the process of speech rehabilitation after surgery for cancer of the oral cavity and oropharynx. The criteria on the basis of Minkowskian distance between normalized spectra defective phonemes was formed. Preliminary parameter of the distance for the most informative criterion was selected. The approbation of the proposed method of assessing the quality of pronouncing phonemes [t] and [k] on a real patient records was implemented. The tasks for the next phase of the study were set. This work is one part of the big task of assessing the quality of speech in the speech rehabilitation.