Evaluation of the Speech Quality During Rehabilitation After Surgical Treatment of the Cancer of Oral Cavity and Oropharynx Based on a Comparison of the Fourier Spectra

Kostyuchenko, Evgeny; Roman, Mescheryakov; Ignatieva, Dariya; Pyatkov, Alexander; Choynzonov, Evgeny; Balatskaya, Lidiya

doi:10.1007/978-3-319-43958-7_34

Evaluation of the Speech Quality During Rehabilitation After Surgical Treatment of the Cancer of Oral Cavity and Oropharynx Based on a Comparison of the Fourier Spectra

Evgeny Kostyuchenko¹⁶,
Mescheryakov Roman¹⁶,
Dariya Ignatieva¹⁶,
Alexander Pyatkov¹⁶,
Evgeny Choynzonov¹⁷ &
…
Lidiya Balatskaya¹⁷

Conference paper
First Online: 13 August 2016

2225 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9811))

Abstract

In this paper, we propose the selection of parameters for quality evaluation criterion of pronunciation of certain phonemes. Is presented a comparison of the different options and criteria for the selection of the parameter metric serving their basis - the Minkowskian metric. This approach is used for the comparative assessment of the quality of their utterances in the process of voice rehabilitation of patients after surgical treatment of cancer of the oral cavity and oropharynx. The pronunciation before surgery, taken as a etalon, and after the operation in the course of employment with a speech therapist are compared. The proposed criterion is calculated based on a comparison of the Fourier spectra of these signals and detect differences on the basis of Minkowskian distance. Pre-signals are subjected to the procedure of normalization for the comparability of the spectra. At the end of the experiment the value of the Minkowskian distance parameter to ensure the greatest legibility signals in comparing the quality of pronunciation was suggested. Various approaches to the formation of the quality evaluation criteria pronouncing phonemes are presented. The applicability of the proposed approach for an objective comparative evaluation of the quality of pronouncing phonemes [k] and [t] in patients before and after surgery is confirmed.

Download conference paper PDF

1 Introduction

The problem of speech rehabilitation of patients after surgical treatment of cancer of the oral cavity and oropharynx is relevant. The higher incidence of these diseases says in favor of the relevance of this problem. In 2014 in Russia the incidence of cancer of the oral cavity and pharynx was 5.6 per 100 thousand. The prevalence $- 36.5$ per 100 thousand. Thus, each year in the country revealed near 13,000 new cancers oropharyngeal localization, and the total number of patients suffering from this disease calculated over 53,000 people [1, 2]. One of the main problems in the treatment is the need of learning the speech of patients after partial or complete surgical removal of some organs of speech production path, for example, tongue. During rehabilitation it is necessary to estimate the quality of the patient’s speech. Until recently, this problem was solved only by subjective evaluation of speech quality. In previous studies we have proposed a method based on the use of GOST R-50840-95 Speech transmission over varies communication channels. Techniques for measurements of speech quality, intelligibility and voice identification [3]. This technique allows to obtain a quantitative assessment of the quality of speech, for example, syllable intelligibility. However, to obtain objective assessments in the framework of this method requires an estimate of not less than 5 auditors. In terms of the actual process of voice rehabilitation, this requirement is at least exigeant. There is a task of automation of the patient’s quality of speech estimation while minimizing the participation of a speech therapist. Such assessment may be obtained by comparing the reference speech (speech of patient before the surgery), and the speech in the rehabilitation process. This approach solves the problem of taking into account the characteristics of different speakers - comparisons are made only in one speaker, that simplifies the solution of the problem. This paper presents one of the steps to solve this problem - the formation of the quality criteria of pronouncing certain phonemes.

2 The Current State of Research

In previous stages of research was carried out the analysis of groups of the phonemes at the greatest change at a surgical treatment of cancer of the oral cavity and oropharynx. Using developed software for automation the evaluation of speech quality on the basis of GOST R 50840-95 [4, 5] it have been received the list of the phonemes at the greatest change, namely [r], [t], [s], [f], [k], and also their softened options. In many cases, there was a modification of the softening feature. Comparison of the received list with classical sources, that contains information about the phonemes at the greatest change by this disease, was carried out [6] (referencing to earlier work of the authors team I. Bolov, M. Solovyev, L. Dushak, D. Podgornykh, A. Shenderov, 1974.), high extent of coincidence confirms reliability of the obtained data. Besides, it has been made a preliminary research of Fourier spectrums of phonemes which are most prone to change that has allowed to make the assumption of the greatest susceptibility to change of the top part of a spectrum (areas with a frequency of 2 kHz and above). The example of these spectrum is presented in the Fig. 1. On the basis of the analysis of the received results the task on the current stage of implementation of the project has been formulated.

3 Setting Objectives for Research

In this stage it was proposed to form a simple measure to estimate the differences between the normal pronunciation of sounds (as a reference are used the sounds obtained from patient records syllables recorded in the preliminary examination before surgery) and sounds in process of rehabilitation. The first step is to make assumptions based on the analysis of the spectra of healthy speakers, who in the first instance pronounced syllables in normal mode, and in the second - with the minimization of the use of language in speech production. It was set the task of definition of localization of the used range to form a measure of distinction: does it better to use the whole syllable or changed phoneme only, as well as the whole spectrum or only the upper part. The task of the choice of a concrete measure of distinction at determination of distance between the received ranges is also set. As part of this task is carried out research of the application of the Minkowskian distance (1) at different values of parameters. At the same time are considered special cases of this distance - the Euclidian distance (2) and the Manhattan distance (3) $(\text {p}=2, \text {p}=1)$ [7]:

$$\begin{aligned} \rho (x,y) = \left( \sum _{i=1}^n |x_i-y_i|^p\right) ^{1/p}, \end{aligned}$$

(1)

$$\begin{aligned} d(x,y)=\sqrt{\sum _{i=1}^n (x_i-y_i)^2}, \end{aligned}$$

(2)

$$\begin{aligned} m(x,y) = \sum _{i=1}^n |x_i-y_i|, \end{aligned}$$

(3)

where x is a first signal, y is a second signal, i is a position in the signal, $\rho $ is the Minkowskian distance, d is the Euclidian distance, m is the Manhattan distance, p is a parameter of Minkowskian distance.

Minkowskian distance is used to solve the problem for different values of its parameters. The following values of parameter p are selected: $\text {p} = [- 10\, -9.9\, -9.8~...~9.9\, 10]$. This involves a consideration of the practical application of these values, in spite of the fact that for $\text {p}<1$ considered distance is not a metric. After receiving the values of distances between different implementations of phonemes in the normal and modified pronunciation it was carried out consideration of some preliminary approaches to formation the criterion of phonemes pronouncing quality on the basis of the distance. The details of these criteria and their characteristics are given in the appropriate section. It is carried out the assessment of the received results at various parameters of determination of Minkowskian distance on the basis of the offered criteria and also it is made the preliminary choice of parameter for practical use. After the choice of a measure for a distance assessment between realization of syllables or phonemes, and also the analysis of criteria for evaluation of quality of pronouncing it is carried out an inspection of the received assumptions with use of records of real patients and it is made the decision on applicability of the offered approach.

4 The Basic Signal Processing Steps in Determining the Quality of the Pronunciation of Syllables or Phonemes

Speech signal processing can be represented as a sequence of performing the following steps:

1.
normalization of all studied speech signals on duration;
2.
normalization of all studied speech signals on signal power;
3.
definition of the Fourier spectrum [8] of all signals. The calculation is carried out with an analysis window of 256 samples and the offset between the windows in a 1 count;
4.
determination of paired distances between all signals on the basis of Minkowskian distance;
5.
calculation of an assessment of quality of pronouncing of a syllable or phoneme on the basis of the analysis of paired distances between various realizations of a syllable (phoneme).

As a result of this procedure it turns out the criterion or a set of the criteria allowing to estimate quality of pronouncing a syllable (phoneme).

5 Preliminary Normalization of a Speech Signal

5.1 Normalization of All Studied Speech Signals on Duration

It is carried out a reduction of each phoneme to duration equal 0.050 s with application of interpolation. Using of this value is based on the fact that there is no loss of information because it is certainly more than the duration of a single phoneme.

5.2 Normalization of all Studied Speech Signals on Middle Power of Signal

It is carried out a reduction of signals to identical power if they contain identical number of phonemes. Coefficient of normalization can be defined by square root of Middle power of a signal. Middle power of a signal is determined by a formula [9]:

$$\begin{aligned} MP=\sum _{i=1}^n \frac{A_i^2}{n}, \end{aligned}$$

(4)

where A is the amplitude of signal on descrete number i, MP is the middle power value, n is the length of signal.

This normalization is non-obvious and disputable at application in the syllables consisting of several phonemes differing in the compared syllables. The reason is different contribution of phonemes to the overall energy value, but when used in a single phoneme this deficiency is absent.

6 Determination of Paired Distances on the Basis of Minkowskian Distance

Taking into account the processed spectrum of signal and the normalization of the spectrum length the measure takes the form below:

$$\begin{aligned} l(x,y) = \left( \frac{\sum _{j=1}^{n_f} \sum _{i=1}^{n_t} |x_{ij}-y_{ij}|^p}{n_f \cdot n_t}\right) ^{1/p}, \end{aligned}$$

(5)

where x is the first specrum, y is the second spectrum, i is the number of time discrete, j is the number of frequency discrete, $n_i$ is the count of time discrete, $n_j$ is the count of frequency discrete.

Results of calculation are brought in the diagonal matrix presented in Fig. 2. In this case area $S_1$ corresponds to comparison of the initial and modified signals, area $S_2$ to comparison only of initial signals and area $S_3$ to comparison only of the modified signals. $n_1$ is a quantity of initial signals, $n_2$ is a quantity of modified signals. $l_{ij}$ is a distance between spectums of signals number i and j.

Average distance by each area are determined on the basis of this matrix using formulas:

$$\begin{aligned} \overline{l_1}=\frac{\sum _{i,j \in S_1} l_{ij}}{n_1 \cdot n_2}, \end{aligned}$$

(6)

$$\begin{aligned} \overline{l_2}=\frac{2 \sum _{i,j \in S_2} l_{ij}}{n_1 \cdot (n_1-1)}, \end{aligned}$$

(7)

$$\begin{aligned} \overline{l_3}=\frac{2( \sum _{i,j \in S_2,S_3} l_{ij})}{n_1 \cdot (n_1-1)+n_2 \cdot (n_2-1)} \end{aligned}$$

(8)

and also the minimum distance on area $S_1$ and the maximum distance on area $S_2$.

$$\begin{aligned} l_{1min}=min_{i,j \in S_1} l_{ij}, \end{aligned}$$

(9)

$$\begin{aligned} l_{2max}=max_{i,j \in S_2} l_{ij}. \end{aligned}$$

(10)

These values are used to form criterion of phoneme pronouncing quality. As criteria for evaluation of phoneme pronouncing quality is offered:

1.
The ratio between average distance between the initial and modified signals to average distance between initial signals.
$$\begin{aligned} Cr_1=\overline{l_1}/ \overline{l_2}. \end{aligned}$$
(11)
The closer this value to 1, the closer modified signal to initial and vice versa. Among the shortcomings can be noted the possibility of almost complete determination of the final value by stands out in big side values;
2.
The ratio between average distance between the initial and modified signals to average distance between signals of one type.
$$\begin{aligned} Cr_2=\overline{l_1}/ \overline{l_3}. \end{aligned}$$
(12)
The closer this value to 1, the closer modified signal to initial and vice versa. Among the shortcomings can be noted the possibility of almost complete determination of the final value by stands out in big side values. Use of distinctions between the modified signals is doubtful because Fof their smaller stability and, as a result, great values of distances. However the possibility of application of this criterion demands additional practical check.
3.
The ratio between the minimum distinction between the initial and modified signals to the maximum distinction between initial signals.
$$\begin{aligned} Cr_3=l_{1min}/l_{2max}. \end{aligned}$$
(13)
If this value more than 1, then obviously metrics for classes aren’t crossed and the farther they from each other, the better created criterion. However in reality the similar assessment is defined by extreme values dropping out of the general set and in practice will almost always be less than 1. Then, on the one hand, the closer this value to 1, the less an area of crossing of sets of value of distances for the initial and modified signals. With another, it isn’t considered the quantity or a share of the signals getting to this area.
4.
A share of couples of initial signals between which distance exceeds the minimum distance between couples of initial and modified signals, and also a share of couples of initial and modified signals between which distance is less maximum between couples of initial signals.
$$\begin{aligned} Cr_{41}=2 \cdot count_{i,j \in S_2} (l_{ij}>l_{1min})/n_1(n_1-1), \end{aligned}$$
(14)

$$\begin{aligned} Cr_{42}=count_{i,j \in S_1} (l_{ij}<l_{2max})/n_1n_2. \end{aligned}$$
(15)
Ideally shares have to be equal 0, is really the less value, the better because the crossings of distances of couple of signals getting to the area are leveled by application of the averaging criterion 1 (on condition of a small amount of such couples).

7 Analysis of the Signal of Healthy Speaker with Using the Proposed Approach

On this stage research conducted on records of one male and one female speakers. 10 records were made by every speaker, herein 5 first and 5 last syllables differed, but they contain the same phoneme in same part inside the syllable.

Further comparison was made for every individual speaker. Comparison of different speakers between each other obviously less important, because accounting of several factors in the same time (specific speaker and condition of speech formation tract), that lead to a change in pronunciation is problematic.

Below in Fig. 3 is shown the values of obtained criteria of quality for all area of studied values with step in 1 (on left half). Values of criteria are also presented for the most characteristic part on the results of the previous stage of the experiment (on right half).

At the result is possible to say, that most appropriate Minkovskian distance when the parameter p is between 1.6 and 3.1. Moreover, Fig. 4 shows a similar addiction, obtained by upper half of the spectrum (3–6 kHz).

Results showed that on this stage of considered range of values is sufficient to preselect the Minkovskian distance parameter and it not requires further expansion. The usefulness of localization in upper range of frequency, potentially more informative according to the results of preliminary experiments, it is not confirmed and requires further research.

Preliminary experiment on recordings of real patients was conducted to confirm the results. It is possible to talk about the correctness of the findings, but in this article is not enough space for its detailed description and it will be made in subsequent publications. In addition, the next step will be the using of mel-cepstral coefficients [10], linear prediction coefficients [11] and autocorrelation [12] for evaluation of speech quality. Automation of the segmentation of syllables into phonemes also is a problem for next stage of research [13].

8 Conclusion

As part of this work presents the results of phase for the formation of the quality evaluation criteria pronouncing phonemes by patient in the process of speech rehabilitation after surgery for cancer of the oral cavity and oropharynx. The criteria on the basis of Minkowskian distance between normalized spectra defective phonemes was formed. Preliminary parameter of the distance for the most informative criterion was selected. The approbation of the proposed method of assessing the quality of pronouncing phonemes [t] and [k] on a real patient records was implemented. The tasks for the next phase of the study were set. This work is one part of the big task of assessing the quality of speech in the speech rehabilitation.

References

Kaprin, A.D., Starinskiy, V.V., Petrova, G.V.: Status of cancer care the population of Russia in 2014. Moscow, MNIOI name of P.A. Herzen, Moscow (2015)
Google Scholar
Kaprin, A.D., Starinskiy, V.V., Petrova, G.V.: Malignancies in Russia in 2014 (Morbidity and mortality). MNIOI name of P.A. Herzen, Moscow (2015)
Google Scholar
Standard GOST R 50840–95 Voice over paths of communication. Methods for assessing the quality, legibility and recognition. Publishing Standards, Moscow (1995)
Google Scholar
Balatskaya, L.N., Choinzonov, E.L., Chizevskaya, S.Y., Kostyuchenko, E.U., Meshcheryakov, R.V.: Software for assessing voice quality in rehabilitation of patients after surgical treatment of cancer of oral cavity, oropharynx and upper jaw. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 294–301. Springer, Heidelberg (2013)
Chapter Google Scholar
Kostyuchenko, E.Y., Mescheryakov, R.V., Balatskaya, L.N., Choynzonov, E.L.: Structure and database of software for speech quality and intelligibility assessment in the process of rehabilitation after surgery in the treatment of cancers of the oral cavity and oropharynx, maxillofacial area. SPIIRAS Proc. 32, 116–124 (2014)
Google Scholar
MedFind. Oncology. Plastic surgery in the surgical treatment of tumors of the face, jaws. http://medfind.ru/modules/sections/index.php?op=viewarticle&artid=324
Kim, D.O., Myuller, C.U., Klekka, U.R.: Factorial, Discriminant and Cluster Analysis. Finance and Statistics, Moscow (1989)
Google Scholar
Sergienko, A.B.: Digital Signal Processing. Peter, St. Petersburg (2006)
Google Scholar
Max, J.: Methods and signal processing equipment for physical measurements. In: 2 vols, Translation from French. Mir, Moscow (1983)
Google Scholar
Rabiner, L.R., Schafer, R.W.: Introduction to Digital Speech Processing. Foundations and Trends in Signal Processing (2007)
Google Scholar
Benesty, J., Sondhi, M.M., Huang, Y. (eds.): Springer Handbook of Speech Processing. Springer, Heidelberg (2008)
Google Scholar
Shuyin, Z., Ying, G., Buhong, W.: Auto-correlation property of speech and its application in voice activity detection. In: First International Workshop on Education Technology and Computer Science. ETCS 2009, pp. 265–268 (2009)
Google Scholar
Gold, K., Scassellati, B.: Audio speech segmentation without language-specific knowledge. In: Cognitive Science, pp. 1370–1375 (2006)
Google Scholar

Download references

Acknowledgments

The study was performed by a grant from the Russian Science Foundation (project 16-15-00038).

Author information

Authors and Affiliations

Tomsk State University of Control Systems and Radioelectronics, Lenina str. 40, 634050, Tomsk, Russia
Evgeny Kostyuchenko, Mescheryakov Roman, Dariya Ignatieva & Alexander Pyatkov
Tomsk Cancer Research Institute, Kooperativniy av. 5, 634050, Tomsk, Russia
Evgeny Choynzonov & Lidiya Balatskaya

Authors

Evgeny Kostyuchenko
View author publications
You can also search for this author in PubMed Google Scholar
Mescheryakov Roman
View author publications
You can also search for this author in PubMed Google Scholar
Dariya Ignatieva
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Pyatkov
View author publications
You can also search for this author in PubMed Google Scholar
Evgeny Choynzonov
View author publications
You can also search for this author in PubMed Google Scholar
Lidiya Balatskaya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evgeny Kostyuchenko .

Editor information

Editors and Affiliations

SPIIRAS , Saint-Petersburg, Russia
Andrey Ronzhin
Moscow State Linguistic University , Moscow, Russia
Rodmonga Potapova
Budapest University of Technology and Economics, Budapest, Hungary
Géza Németh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kostyuchenko, E., Roman, M., Ignatieva, D., Pyatkov, A., Choynzonov, E., Balatskaya, L. (2016). Evaluation of the Speech Quality During Rehabilitation After Surgical Treatment of the Cancer of Oral Cavity and Oropharynx Based on a Comparison of the Fourier Spectra. In: Ronzhin, A., Potapova, R., Németh, G. (eds) Speech and Computer. SPECOM 2016. Lecture Notes in Computer Science(), vol 9811. Springer, Cham. https://doi.org/10.1007/978-3-319-43958-7_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-43958-7_34
Published: 13 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43957-0
Online ISBN: 978-3-319-43958-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics