Keywords

1 Introduction

In HHI the behavior of the speaker is mainly characterized by semantic and prosodic cues such as short feedback signals. These signals further the progress and coordination of interaction. These cues transmit so called meta-information about certain dialogue functions such as attention, understanding, confirmation or other attitudinal reactions [1]. DPs such as “ja”, “so”, “wie” or “hm” belong to these feedback signals and serve as independent small utterance units, occurring at communicative decisive points of the conversation, without interrupting the speaker [16]. As these signals only have a very small semantic content, special attention needs to be paid to their prosody. By changing the intonation, a DP can have several different meanings. Therefore, “hm” is one of the most diverse DP. For the German language seven form-function-relations for the isolated speech signal of the DP “hm” have been revealed by Schmidt [12]. They describe the relation between the functional-meaning of the DPs and specific pitch-contours, see Table 1.

Table 1. Form-function relation of DP“hm” according to [12], the terms are translated into appropriate English ones.

Furthermore, DPs are verifiably used in both HHI and HCI [5, 12, 15] and specific form-function-relations could be confirmed [8, 9]. Also influences of specific subject characteristics, as age and biological gender, as well as certain personality traits on the use of DPs were uncovered [6, 14].

To obtain a more human-like interaction in communicating with a technical system, the system needs to be capable of understanding these feedback signals and react appropriately. This enables a more natural interaction and the system thus becomes the users’ attendant and ultimately his companion [2].

To achieve this aim, we developed a classifier to distinguish between the different functional-meanings of the DP “hm” [8]. The classifier is trained on speech material from HCI. In this paper we describe the adaptation of this classifier to a HHI corpus and show that it is possible to assign non-isolated DPs from HHI. Furthermore, we show the possibility to assign one of the seven prototypes obtained by Schmidt by only applying small changes to the existing algorithm. Compared to the results of HCI we assume an increase of partner-oriented feedback signals in case of HHI [5].

The remainder of the paper is structured as follows: Sect. 2 shortly describes the utilized dataset providing HHI as well as HCI. Then, Sect. 3 describes the DP-classifier developed on the HCI data in detail. Afterwards, Sect. 4 describes the conducted manual labeling process as well as the adaptation of the classifier based on the new obtained HHI data. Section 5 presents and discusses the results of the improved DP classification algorithm and analyses the manual labeling. Finally, Sect. 6 concludes the paper and provides an outlook for further research directions.

2 Datasets

The LAST MINUTE Corpus (LMC) contains 130 high-quality multi-modal recordings of German speaking subjects during Wizard-of-Oz experiments [10, 11]. The setup of the HCI revolves around an imaginary journey to the unknown place Waiuku. With the help of an adaptable technical system, the subjects have to prepare the journey, by packing the suitcase, and select clothing and other equipment by using voice commands. Each experiment takes about 30 min. For a sub-set of the corpus a total number of 259 DPs are extracted from 25 h of speech material received from 56 subjects. They serve as training data for the basis classification algorithm presented in [8].

The ALICO Corpus is recorded to investigate feedback behavior changes in HHI concerning distraction of the listener [4]. The corpus consists of 2\(\times \)25 dialogues in German language, in which one person is telling the other participant two stories. In the first talk the listener gets the instruction to pay attention, make remarks and ask questions. In the second one an additional distraction task is given to the listener. He gets the instruction to press a button on a hidden counter every time the story teller utters a word starting with the letter ‘s’. A sub-set of 40 dialogues is annotated, resulting in 1505 feedback signals from 3 h of speech material. Out of these signals 537 are marked as “hm” and used for the investigation presented in this paper.

Fig. 1.
figure 1

Structural flowchart of the classification-algorithm

3 The Discourse Particle Classifier

The DP-classifier uses a rule-based approach. Before the classification a pre-processing step is conducted, see Fig. 1. The following function analysis is based on regression calculation on the pitch-contour. Depending on certain threshold values the classifier decides to which form-prototype the given pitch-contour belongs. The thresholds were set by preliminary investigations. During these investigations, additionally to the seven form-function-prototypes of Schmidt, two more frequently occurring formtypes were identified, see Table 2. No information about the functional use of these formtypes is given, as the work [8] only deals with the course of the pitch-contour and not the identification of its meaning.

Table 2. Additional form types as stated in [8].

Depending on the coefficient of determination (\(R^2\)) either a first, second or third order regression is performed. If the coefficient exceeds the threshold value:

$$\begin{aligned} TH_{determination} = 0.9 \end{aligned}$$
(1)

It is assumed that the regression function describes the original pitch-contour sufficiently. Only regarding the slope of the first order regression line, the tendency of the contour is determined (horizontal, rising or falling). This allows the classifier to neglect some prototypes for further assignment. For example, if the first order regression line has a rising tendency, all prototypes with a horizontal (DP-2) or falling tendency (DP-1, DP-3, DP-5 or DP-8) can be neglected. This leaves only the prototypes DP-4, DP-6, DP-7 or DP-9 for the later assignment.

Considering the order of regression, the prototype-courses can be distinguished into 3 types: linear (DP-2, DP-3, DP-7), quadratic (DP-4, DP-5, DP-8, DP-9) and cubic ones (DP-1, DP-6). If a pitch-contour is sufficiently described by a linear regression function, it can be unambiguously assigned to a linear prototype by looking at its course tendency. In case of contours described by higher order regression, the course is divided into sections limited by their turning points. Based on the timeratio and slope of these sections the prototypes are assigned. The thresholds for these parameters are:

$$\begin{aligned} TH_{timeratio} = 1/3 \end{aligned}$$
(2)
$$\begin{aligned} TH_{slope} = 40 \text { Hz/s} \end{aligned}$$
(3)

If two consecutive sections exceed \(TH_{timeratio}\), both sections will be taken into account for further assignment. Otherwise, only the dominant part (longer section) will be considered. For \(TH_{slope}\), the courses (linear regression) and sections (higher order regression) are assumed to be horizontal if the absolute slope is below the threshold and rising or falling for slopes exceeding the threshold, respectively.

As the investigation is implemented on a HCI corpus, most DP samples express talk-organizing or expressive functions. These are mainly DPs of type DP-2. Therefore, an exception in the algorithm is made for regression lines with horizontal tendencies. Disregarding \(TH_{determination}\), if the standard deviation of the original pitch-contour is less than 7 Hz the type DP-2 is assigned. Variations of this size can be neglected, as the human ear is not able to perceive them [17]. More information on the classification algorithm especially the pre-processing of the original pitch-contours is given in [8].

To evaluate the implemented classifier a manual annotation of the pitch-contours is performed. A consistency of the results is obtained in approx. 79 % of the given DPs contained in the LMC HCI corpus.

4 Adaptation of the Classifier

As the existing classifier is only trained on the LMC HCI corpus, the amount of training material for partner-oriented prototypes is recognizably low, see row HCI in Table 4. As most of the prototypes detected in HCI are of type DP-2, DP-3 and DP-7, having a linear course of the pitch-contour, a verification of contours described by higher dimensional functions is not given. Another problem is the low number of DPs used to evaluate the classifier, as not all pitch-contours occurring in HHI are considered in the original implementation, leading to impossible assignment of prototypes. Therefore, an adaptation of the classifier is needed.

4.1 Manual Labeling

To verify the results of the algorithm [8] and adapt the classifier to non-isolated DPs from HHI we first utilized a manual labeling. The labeling was conducted in two stages: First, a visual assignment of the intonation curve, based on extracted pitch-values, to one of the seven idealized pitch-contours was performed. DP-8 and DP-9 were left unconsidered on purpose, as no functional-meaning of these types is known. We assumed that these formtypes are variants of the verified form-function-prototypes by Schmidt. Second, an acoustic annotation of the functional-meaning without knowledge of the pitch-contour was conducted (see Table 1). For the acoustic annotation, a modified version of ikannotate (cf. [3]) was used. To clarify the results stated in Sect. 5.2: For the acoustic annotation the labelers were not instructed to annotate the perceived intonation-curve but the functional-meaning according to the audible context the DPs occurred in.

4.2 Adjustment of the Threshold Parameter

As already stated in our earlier investigation on DPs in HCI [7], one major problem of a classifier using the intonational-curve of the signals is the subject of pitch-extraction. In our case the considered DP “hm” is a unvoiced utterance. This makes it hard to get a robust pitch-estimation, as there is no stimulation of the vocal cords at the glottis. The pitch-extractor will not be able to continuously estimate a pitch-contour. This leads to gaps and/or jumps in the signal. To ensure the reliability of the classifier, we assumed that at least 50 % of the contour needs to be available to the function analysis algorithm. For all contours with a lower percentage of existence a correct assignment of the prototypes’ functional-meaning by the classifier is not implied. Figure 2 depicts a frequently occurring mismatch of prototypes: While the extracted pitch-contour will lead to a DP-3 (finalization signal) assignment, the complete contour will get assigned to DP-4 (confirmation). The made assumption has no influence on the actual assignment of the prototypes, but there is no point in correctly assigning a formtype, if it is not possible to assign the functional-meaning correctly.

Fig. 2.
figure 2

Frequently occuring mismatch in prototype.

To optimize the performance of the classifier, first, the function analysis algorithm was extended. Therefore, a new threshold was introduced \(TH_{sloperatio}\). In cases of the description of the pitch-contours using higher order regression functions, not only the timeratio but also the sloperatio of two consecutive sections needs to be considered for the assignment of the prototypes. This will lead to a better differentiation of linear and higher polynomial (quadratic and cubic) prototypes (cf. Sect. 3), which were rarely considered in the earlier investigation on DPs in HCI [8]. Furthermore, the thresholds were adjusted:

$$\begin{aligned} TH_{timeratio} = TH_{sloperatio} = 1/6 \end{aligned}$$
(4)

This is necessary to make the classifier more sensible for higher polynomial prototypes.

5 Results

In this section the results of the original and optimized classifier are compared. The performance of the classifier is determined depending on the visual and acoustic annotation of the labelers. If not indicated differently, all results refer to the 537 extracted DPs obtained from the ALICO HHI Corpus.

5.1 Classification-Algorithm

In the pre-processing of the classifier, all pitch-contours that are too short (114) or failing the requirement made in Sect. 4.2, saying that at least 50 % of the pitch-contour needs to be available to the classifier, (177) were not considered for the classification. In total 537 DPs were given as input to the function analysis loop, leaving 423 for further investigation to the original and 246 to the optimized classifier. The results can be found in Table 3.

Table 3. Frequency distribution of the results of the classification-algorithm. (a) original classifier (423 DPs) b) adjusted classifier (246 DPs); additionally used labels: NSP “no specification possible”

In both cases (original and optimized classifier) a high number of DP-2 was recognized. In case of the original classifier there was an assignment of linear courses of type DP-3 and DP-7. This is explainable by the development of the rule-based classifier on a corpus for HCI, including mostly linear prototypes. The optimization of the classifier led to a decrease of these linear types and an increase of quadratic and cubic prototypes of type DP-4, DP-6 and DP-8, as assumed and supported by the results of the following Sect. 5.3. In comparison to HCI (cf. Table 4), we obtained a remarkable number of detected DP-4 which corresponds to the assumption made in the introduction, that the number of partner-oriented feedback signals will increase in HHI. This is expectable, as the style of the dialogue is more interview-like leading the listener to use feedback signals such as the DPs to express confirmation.

Table 4. Frequency distribution of the results of the original classifier comparing HCI (LMC corpus) with HHI (ALICO corpus)

5.2 Manual Labeling

The manual labeling was assessed using majority voting. For both assignments (visual and acoustic) 10 labelers were consulted. Then a majority voting of the results of the 5 labelers with the highest inter-rater reliability (Krippendorff’s \(\alpha _{visual} = 0.64 \text { and } \alpha _{acoustic} = 0.16\)) was carried out. The low reliability for the acoustic labeling is due to the many degrees of freedom the labelers had, as no further restrictions was given except for the functional-meanings. These meanings, defined by Schmidt, describe sometimes quite similar functional-meanings, which are hard to distinguish acoustically. The low values of \(\alpha _{acoustic}\) are rather typical across several databases, as it was shown in [13] for various emotional assessments.

The results of the majority voting are stated in Table 5. Considering the inter-rater reliability and the results of the majority voting we can already see, that it is hard for all labelers to agree on one functional-meaning. The low inter-rater reliability of the annotation of the functional-meaning can also be justified by the calculation algorithm of Krippendorff’s \(\alpha \). As the majority of the data is rated as DP-4, all single mismatches are rated as unlikely and therefore are less reliable.

Table 5. ALICO data: Frequency distribution of the majority voting. (a) visual assignment of the idealized pitch-contours (b) acoustic annotation of the functional-meaning; additionally used labels: NSP “no specification possible” and NM “no majority”

5.3 Pitch-Contour Assessment

For the manual labeling process DP-8 and DP-9 were left unconsidered, as no statement about the functional-meaning of these prototypes is given. The pitch-contours assigned to DP-8 or 9 by the classifier will, accordingly, never match with the manually annotated labels and therefore, not be taken into account to calculate the performance of the classifier concerning the visual annotation results. Without any adaptation of the function analysis algorithm or threshold parameters, the original classifier correctly identified 70.5 % of all DPs in non-isolated HHI. By optimizing the classifier this value was raised up to 74.5 %.

Table 6. Confusion matrix of the prototype assignment of the original classifier compared to the manual labeling; additionally used labels: NSP “no specification possible”

Table 6 shows the confusion matrix of the manual labeling of the formtypes and the results of the original classification algorithm. Most of the mismatches are noticed between DP-2 and DP-3, DP-7. In these cases only a disagreement in the slope of the course was found. It can be concluded, that the labelers were more likely to identify the perceived signals as sloping than horizontal.

Moreover, we recognized a high confusion between the formtypes DP-7 and DP-4, illustrated in Fig. 3. This also explains the unexpected high number of DP-7 in the visual annotation of the pitch-contours. In these cases the labelers were not sure on which prototype to decide on. Both cases of mismatch were minimized by the adjustment of the classification algorithm described in Sect. 4.2. For the optimized classifier no clear majority in the mismatches of prototype assignment was recognized.

Fig. 3.
figure 3

Illustration of the confusion of real signals between the prototypes of DP-4 and DP-7.

5.4 Functional-Meaning Assessment

We now considered the results of the acoustic functional-meaning assessment. Regarding the results stated in Table 5 (b) a clear majority of DP-4 (confirmation) was recognized. In comparison the results of the pitch-contour assignment showed a distribution over all considered prototypes. The high deviation of both assignments will clearly lead to a low performance of the classifier concerning the functional-meaning of the DPs. Only considering the annotation results of the labelers, will lead to a consistency in 28.45 % of all cases. The classifier obtained a slightly higher performance of 33.49 %. This makes it impossible for the user to get a reliable statement on the functional-meaning of the pitch-contours. A reason for this low agreement is the number of overlaps contained in the DP audio-samples. To still be able to rate the performance of the classifier for the ALICO Corpus a new approach is introduced. As the annotation of the functional-meaning has a strong majority of DP-4 we reduced the classification problem from a seven class problem to a two class problem containing the classes “DP-4” and “other DP”. The confusion table of this new classification problem is depicted in Table 7, resulting in a (foreseeable) recall of 31.05 % and a (remarkable) precision of 90.67 %.

Table 7. Confusion table of the reduced 2-class classification problem. Recall = 31.05 %, precision = 90.67 %

This means, almost all prototypes assigned to type DP-4 are true confirmation signals. Furthermore, looking at the idealized pitch-contours of DP-6, DP-4, DP-8 and DP-9 we noticed, that all of these prototypes can be represented within DP-6. This phenomenon is depicted in Fig. 4. As the functional-meaning of DP-6 and DP-4 state similar descriptions (confirmation and positive assessment \(\rightarrow \) positive feedback) we generalized these functional-meanings into one class, also containing DP-8 and DP-9. The acoustic evaluation of these formtypes also confirms this assumption. This led to an even higher recall of 51.10 % and precision of 95.08 %.

Fig. 4.
figure 4

Comparison of pitch-contours with similar functional-meanings

In case of the LMC HCI dataset we got a higher match (cf. Table. 8) of the acoustic labeling. Only the ratios of the assignments were available and not the direct assignments of the data-samples. Nevertheless, a high agreement in the frequency occurring of DP-2 and DP-4 is recognized. We assume that a direct assignment of the data-samples will also lead to a better match of pitch-contour and functional-meaning compared to the results of the seven class problem presented earlier in this section. This is explainable by the style of interaction: In almost no cases the computer interrupts the speaker. This leads to a low number of overlaps in the DP-samples and an almost “isolated” occurrence of the pitch-contours.

Table 8. LMC data: Frequency distribution of the a) classifier b) acoustic annotation of the functional-meaning; additionally used labels: NSP “no specification possible”

6 Conclusion and Discussion

As a conclusion, we state that a high congruence of the visual manual labeling and the DP-classifier for pitch-contours could be shown (74.53 %). Thus, a cross-usage of HCI-trained classifiers applied on HHI data is viable, and our classifier can be used, to robustly evaluate the contours of DPs in both HCI and HHI. For the acoustical evaluation of the functional-meaning, the idealized form-function prototypes by Schmidt are not suitable in case of naturalistic HHI. For the given ALICO dataset it was possible to reduce the seven class classification problem to a two class problem obtaining a foreseeable recall of 31.05 % and a remarkable precision of 90.67 %.

Furthermore, we can state that the given audio material of the ALICO corpus is not the most suitable dataset for this investigation. As the annotated feedback signals were not assigned to the different speakers, an overlap of the speakers is possible. This can lead to failures in the pitch-estimation, resulting in a high number of correct assignments of pitch-contours to the prototypes but no correlation of their functional-meaning. A mapping of the DPs to the speakers could lead to an increase of matches in the functional-meaning, as for both, speaker and listener separate headset microphone recordings are available. Additionally, most of the DPs contained in the corpora were of type DP-4. To get a significant statement about the classification of the functional-meaning the dataset should include a balanced distribution of all form-function-prototypes identified by Schmidt in an appropriate audio quality. As the investigation so far only deals with the classification of “isolated” DPs, in terms of no surrounding content of the considered DPs, the desired dataset can be merged from different corpora of the same conversational style (HCI/HHI) and level of naturalness (naturalistic/acted).