Revisiting clustering for efficient unsupervised dialogue structure induction

Raedt, Maarten De; Godin, Fréderic; Develder, Chris; Demeester, Thomas

doi:10.1007/s10489-024-05455-5

Revisiting clustering for efficient unsupervised dialogue structure induction

Open access
Published: 18 April 2024

Volume 54, pages 5278–5305, (2024)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Revisiting clustering for efficient unsupervised dialogue structure induction

Download PDF

Maarten De Raedt ORCID: orcid.org/0000-0002-9661-1206^1,2,
Fréderic Godin¹,
Chris Develder² &
…
Thomas Demeester²

610 Accesses
Explore all metrics

Abstract

In the development of a task-oriented dialogue system, defining the dialogue structure is a time-consuming task. Hence, several works have looked into automatically inferring it from data, e.g., actual conversations between a customer and a support agent. To recover such dialogue structure, recent methods based on discrete variational models learn to jointly encode and cluster utterances in dialogue states, but (i) represent utterances by only considering preceding dialogue context, and (ii) are slow to train since they are optimized with a compute-expensive decoding objective. We revisit and improve upon an existing efficient pipeline approach, commonly adopted as a baseline, that first encodes utterances and then clusters them with k-means to induce the dialogue structure. However, the existing approach represents utterances as bag-of-words or skip-thought vectors, which have been shown to perform poorly in semantic similarity tasks, and without considering dialogue context. We therefore first investigate the use of more powerful transformer-based encoders for encoding utterances. Next, we propose ellodar, a method for learning representations that capture both preceding and subsequent dialogue context, inspired by word-to-vec training strategies. ellodar is efficient since representations are learned directly in the encoding space by finetuning just a single linear layer on top of a frozen sentence encoder with a vector-to-vector regression training objective. Extensive experiments on representative datasets for dialogue structure induction (SimDial, Schema Guided Dialogues, DSTC2, and CamRest676) demonstrate that in terms of effectiveness to induce the correct dialogue structure, (i) clustering utterances represented by transformed-based encoders improves recent joint models by 13%–32% on standard cluster metrics, and (ii) clustering ellodar’s representations yields additional improvements ranging from +20% to +26%, with speedups of $\times $ $\textbf{10}$–$\textbf{10}^{\textbf{4}}$ compared to the recent joint models.

Fine-Tuning BERT for Generative Dialogue Domain Adaptation

Integrating breakdown detection into dialogue systems to improve knowledge management: encoding temporal utterances with memory attention

Article 28 September 2019

A Unified Generation Approach for Robust Dialogue State Tracking

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recently, systems for conversational modeling have enjoyed a lot of attention, including intent classification [1,2,3], dialogue state tracking [4,5,6], and slot filling [3, 7, 8]. Such systems have been used to automate customer service in the insurance and public transportation industries [9], to provide information in healthcare [10], and to perform tasks such as legal case retrieval [11] and conversational recommendation [12].

Designing conversational agents, however, requires readily available annotated data. While companies often have access to an abundance of unlabeled dialogues, such as those exchanged between their customers and support agents, annotating them to develop conversational agents remains costly. Consequently, dialogue structure induction (DSI), aims to unsupervisedly recover the latent conversational structure from a set of task-oriented user-agent dialogues. Figure 1 shows an example of such a graph in which the nodes represent the distinct user and agent (system) dialogue states, and the possible transitions and corresponding probabilities between successive states are denoted by the edges. A dialogue structure can compactly summarize an entire collection of dialogues, providing companies with relevant insights about their customers and agents and they thus give a solid starting point for designing conversational models.

In unsupervised DSI, utterances with similar conversational goals are first clustered into the same dialogue state, and the structure (transition probabilities from one state to another) can then be recovered either directly from the model’s weights or by counting the number of transitions between successive states. Earlier work extended hidden Markov models to infer conversational graphs [13,14,15]. More recently, neural end-to-end models, e.g., DVRNN [16] and SVRNN [17], jointly learn to encode utterances and assign them to dialogue states. Yet, such neural models, (i) represent utterances by only considering the preceding dialogue context, (ii) require GPUs and tend to be slow at inducing dialogue structures as they are trained on a computationally expensive next turn decoding objective. Since DSI models embed the number of dialogue states in their architecture, they must be re-trained every time that number changes. It is thus important to induce dialogue structures efficiently, since in practice, users may need to experiment with different numbers of states to recover the optimal structure.

To address the weaknesses above, our work revisits and further builds upon the method of Gunasekara et al. [18, 19] that comprises two efficient steps in which utterances are first encoded into vectors and subsequently clustered. However, Gunasekara et al. [18, 19] represent utterances as bag-of-words or skip-thought [20] vectors, which have been shown to perform poorly in semantic similarity tasks [21, 22], and without considering dialogue context. In this work, we first demonstrate that encoding utterances by powerful transformer-based sentence encoders instead, already leads to improvements over recent joint models, in terms of both cluster metrics and being orders of magnitude faster at inducing the dialogue structure.

Next, we propose a highly efficient strategy to embed both preceding and subsequent dialogue context into utterance vector representations, called ellodar (for “Efficiently Learnt Locally Dialogue Aware Representations”), which further boosts performance in terms of cluster metrics. We cluster ellodar’s representations to induce dialogue structure, and refer to the complete procedure as cellodar. Regarding the aforementioned limitations of existing works, cellodar (i) uses both preceding and subsequent context, (ii) can be trained on CPU within seconds, and thus (iii) makes determining the number of dialogue states up to four orders of magnitude faster than recent joint models.

To obtain dialogue-aware embeddings before clustering, ellodar draws inspiration from the CBOW and skip-gram (so-called ‘word-to-vec’) models for learning word embeddings [23]: utterances with similar context windows, and context windows enclosing similar utterances, are represented closer to each other in the embedding space. ellodar is efficient (trains within seconds on CPU) as it learns a linear transformation with a vector-to-vector regression training objective in the encoding space of a frozen pretrained encoder by exploiting a local, yet bidirectional, context window. By casting representation learning as vector-to-vector regression, ellodar avoids the computational overhead incurred by decoding objectives, such as those used for training the joint DVRNN and SVRNN models.

Extensive experiments on 10 task-oriented domains spanning across the DSTC2 [24], CamRest676 [25, 26], SimDial [27] and Schema Guided dialogue [5] datasets, show that cellodar, yields absolute improvements over recently proposed joint methods of 7%–74% in standard cluster metrics while being 10 to $10^4$ times faster.

1.1 Research objective and contributions

Our objective is not to outperform existing approaches by merely developing increasingly more complex models. Rather, our goal is to attain state-of-the-art performance while being highly efficient compute-wise, thereby making it feasible to induce dialogue structures in practice. More specifically, we aim to obtain a model that (i) outperforms the more complex joint models, i.e., DVRNN [16] and SVRNN [17], as measured by standard cluster metrics, and (ii) is sufficiently lightweight for inducing dialogue structures on accessible and cheap computing resources such as CPUs (rather than requiring GPUs like such joint models).

We summarize our contributions as follows:

(1)
We revisit the cluster baseline proposed in [18, 19], and demonstrate that clustering utterances encoded by transformer-based sentence encoders [22, 28] rather than by bag-of-words or skip-thoughts vectors, already outperforms the recent joint models for DSI [16, 17] in terms of inducing the correct dialogue structure, while being orders of magnitude faster.
(2)
We contribute ellodar, a highly efficient utterance representation learning approach that exploits local dialogue context to train linear transformations in the encoding space of a frozen sentence encoder using a vector-to-vector regression training objective. Clustering the ellodar representations (referred to as cellodar) is shown to outperform — by a large margin — the joint DVRNN and SVRNN models [16, 17] (while being orders of magnitude faster) as well as the improved transformer-based cluster baselines, on representative DSI datasets.
(3)
Since there exists no common benchmark for DSI, we release^{Footnote 1} our modified datasets, evaluation, and models, which we hope will spur future research in the unexplored DSI task.

2 Related work

We summarize previous research on the relatively unexplored task of unsupervised dialogue structure induction. There are many variations to this task, including both supervised and unsupervised statistical methods that learn structures based on dialogue acts, as discussed in Section 2.1. However, we specifically focus on unsupervised dialogue structure induction for task-oriented dialogues, for which Section 2.2 reviews recent joint models based on neural and variational approaches, and compares them to our proposed approach. In addition, we discuss methods for structure learning based on unsupervised slot extraction in Section 2.3, which is a related but distinct task. Finally, Section 2.4 outlines the various applications for which dialogue structures have been used.

2.1 Unsupervised dialogue act modeling

Early work focused on structure modeling of dialogues based on categorizing utterances into high-level dialogue acts (e.g., question, statement, request, and acknowledgment) and then learning the structure (transitions) among these acts (states). In [29], utterances are manually annotated with dialogue acts, and the general discourse structure is then inferred using stochastic grammars. Since labeling dialogue text thus requires expensive annotation, focus shifted to unsupervised dialogue act learning. Crook et al. [30] use Dirichlet Process Mixtures to cluster utterances into dialogue acts, but their approach does not model structural information that captures transitions between different acts. Therefore, to both model acts and learning the structure among them, Ritter et al. [14] combine hidden Markov and topic models to identify general discourse structure (i.e., dialogue acts) and dialogue-specific topics in non-task-oriented conversations. Joty et al. [31] further improve the approach of [14] by expanding the set of sentence features used to estimate the hidden Markov model’s act emission distribution to include the speaker, relative position, and sentence length in addition to unigrams. Similarly, the method of [32] uses hidden Markov models to model structural dependencies between dialogue acts, but instead estimate the act emission probabilities using Gaussian mixtures, enabling the use of real-valued sentence embeddings such as bag-of-words GloVe vectors to represent utterances, as opposed to discrete features [14, 31] such as, e.g., unigrams, and utterance length.

2.2 Unsupervised task-oriented dialogue structure induction

Rather than the aforementioned works on identifying high-level dialogue acts, another line of work focuses on the modeling of dialogue structures in task-oriented domains, with the aim of categorizing utterances into more fine-grained, task-specific intents. Early approaches, such as those of [13, 15], adopt hidden Markov models (HMMs) to cluster text spans in task-oriented dialogues into states and learn the dependencies between them. Zhai et al. [15] follow a similar approach to the above cited [14], but consider task-oriented dialogues, assuming that utterance words are generated from a mixture of topic models shared across all states rather than having a single model per state.

To better capture the highly non-linear dynamics in dialogues [33], recent solutions have shifted away from simple HMMs towards neural end-to-end models that jointly learn to encode and cluster utterances to induce task-oriented dialogue structures. Shi et al. [16] propose the use of Discrete Variational Recurrent Neural Networks (DVRNNs) to assign turns to discrete latent states, decoding the current turn from its predicted state and the preceding turns. Qiu et al. [17] extend DVRNNs to SVRNNs by adding structured attention [34] over its hidden states, enforcing a structural inductive bias that is more aligned with DSI. In [35], a modification of the DVRNN model that separates user and system utterances instead of treating them jointly is proposed, leading to more accurate assignment of system actions to states. However, the approach of [35] relies on weak supervision from database queries performed by a human at some point in the dialogue, whereas the unsupervised DVRNN and SVRNN models do not require such (weak) supervision. Rather than inducing dialogue structures in task-oriented domains, Xu et al. [36] induce them in an open-domain setting, using a combination of discrete variational models with graph neural networks to hierarchically discover different domains and then learning the structure within each domain. To obtain more easily interpretable structures, Sun et al. [37] propose an Edge-Enhanced Graph Auto-Encoder that induces deterministic dialogue structures.

Our work focuses on unsupervised induction of non-deterministic dialogue structures in task-oriented domains, given that transitions between dialogue states are inherently probabilistic. We thus focus on the same task as the DVRNN [16] and SVRNN [17] models that jointly learn to encode and cluster utterances. However, both those models (i) only consider preceding dialogue context and, because they are based on Variational Auto-Encoders optimized with a next turn decoding objective, they (ii) are slow to train, and (iii) are susceptible to the posterior collapse [38,39,40]. Posterior collapse occurs when the model relies solely on the decoder’s auto-regressive properties to reconstruct inputs, thus bypassing the latent states altogether, which may result in utterances with the distinct conversational goal being erroneously assigned to the same state.

To address these limitations (i)–(iii), our work builds on the method of [18, 19] that comprises two efficient steps: utterances are first (1) encoded as vectors and then (2) clustered into dialogue states (e.g., using k-means). Clustering assigns utterances to states based on vector similarities rather than on an indirect decoding objective. However, the methods used in [18, 19] for representing utterances as vectors, such as bag-of-words and skip-thought vectors, are sub-optimal for semantic similarity tasks [21, 22]. Furthermore, since these bag-of-words or skip-thought vectors are not fine-tuned on task-specific dialogues, the approach of [18, 19] does not utilize dialogue context. Here, we first experiment with using more powerful transformer-based encoders like SBERT [22] and TOD-BERT [28] that are better suited for semantic similarity tasks. Then, we propose ellodar as a method for obtaining task-specific contextual utterance representations by building upon an already pretrained transformer encoder, which is kept frozen, and subsequently learning a linear transformation on top of it with a vector-to-vector regression objective, using both preceding and subsequent context.

2.3 Unsupervised dialogue slot extraction

Similar to our current work, the methods discussed in Sections 2.1 and 2.2 induce dialogue structures by mapping utterances to states. In the related but different slot-based dialogue structure induction task, words or subphrases rather than utterances are mapped to states in task-oriented domains. To this end, Hudeček et al. [41] use weak supervision from rule-based parsers to identify potential slot candidates, which are then clustered into task-specific slots. Qiu et al. [42] employ transfer learning instead, using supervision from domains with available slot annotations to first train a model that detects slot boundaries. The obtained slot boundary detection model is then applied to unseen domains to identify slot candidates, which are subsequently clustered into states. Vukovic et al. [43] extend the transfer learning method of [42] by starting from the same slot boundary detection model, but using topological data analysis methods to increase the recall of the candidate slot extraction step. Rather than extracting slots through weak-supervision or transfer learning, the method of [44] extracts slots completely unsupervised by using self-supervised language models trained on the task-specific dialogues and unsupervised parsers to identify slot candidates, after which these are similarly clustered to obtain slot states.

2.4 Applications of dialogue structures

While in our current paper, we solely focus on structure induction as an information extraction task, the inferred dialogue structure may be further used for other applications. In particular, it can be used for (i) accelerating dialogue policy learning [16, 45, 46], (ii) more controllable and coherent dialogue agents, in open domain [36, 47] and domain-specific settings [48], (iii) response generation in multi-party dialogues [49], (iv) low-resource dialogue state tracking [37], and (v) zero-shot policy learning generalizing beyond a single domain [50].

3 Methodology

In Section 3.1, the DSI task is formalized. We specifically focus on recovering dialogue structures from task-oriented dialogues (Section 2.2), in which there are typically two parties who exchange utterances consecutively [5, 24,25,26,27, 51]. We will refer to the two parties in the dialogues as ‘users’ and ‘systems’ respectively, with the ‘system’ utterances generated by, e.g., a support agent in response to requests from a client (‘user’). We describe the cluster-based approach of [18, 19] in Section 3.2, followed by our proposed ellodar strategy to obtain utterance representations in Section 3.3.

3.1 Task formulation

We are given a set $\mathcal {D}$ containing N dialogues between users and system. Each dialogue $d\in \mathcal {D}$ is a sequence of n utterances, alternating between user utterances $x^{\textsc {u}}$ and system utterances $x^{\textsc {s}}$ (or vice versa): $\big [x_1^{\textsc {u}}, x_2^{\textsc {s}}, x_3^{\textsc {u}},\ldots , x_n^{\textsc {s}}\big ]$. Unsupervised dialogue structure induction aims to infer from $\mathcal {D}$ the conversational graph (V, E) with vertices V and edges E. To this end, those utterances that have a common conversational goal (‘intent’) are mapped onto a common dialogue state $v\in V$ across the corpus. User utterances $x^{\textsc {u}}$ are mapped to a user dialogue state $v\in ~V^{\textsc {u}}$, and system utterances $x^{\textsc {s}}$ onto a system dialogue state $v\in V^{\textsc {s}}$, whereby $V=~V^{\textsc {u}}~\cup V^{\textsc {s}}$ and $V^{\textsc {u}}~\cap V^{\textsc {s}} = \emptyset $. Assigning utterances to the correct state depends on the conversational context such that two utterances with the same wording but in a different dialogue may refer to different dialogue states. The edges $e_{ij} \in E$ represent the probability $p_{i,j}$ of transitioning from state $v_i$ to $v_j$ when following the conversation. Given the alternating user and system utterances in a dialogue, it is assumed that state transitions happen from a user to a system state or vice versa: $\forall (v _{i}, v_{j}) \in ~V^{\textsc {u}} \times V^{\textsc {u}}: p_{v_{i},v_{j}} = 0$ (similar for $V^{\textsc {s}}$).

3.2 Cluster-based dialogue structure induction

We consider the cluster-based method of [18, 19], frequently adopted as a baseline for DSI, which encodes utterances as vectors and then clusters them into the $|V^{\textsc {u}} |$ user and $|V^{\textsc {s}} |$ system states. The transition probabilities $p_{i,j}$ between states $v_i, v_j \in V$ are computed by counting the number of utterances in $v_i$ for which the utterance that follows is in $v_j$ and then normalize by dividing by the total number of utterances in $v_i$:

$$\begin{aligned} p_{i,j} = \frac{\#(v_i \rightarrow v_j)}{\#v_i} \end{aligned}$$

Works that compare against this cluster-based method (i) use sub-optimal embeddings and (ii) do not use dialogue context. In particular, only the current utterance is encoded as a bag-of-words using GloVe [32, 52], word2vec [16, 23] or BERT [17, 37, 53]. Yet, such methods have been shown to produce sentence embeddings of low quality [21, 22]. Thus, we propose ellodar, to efficiently learn locally dialogue-aware representations, by using (i) more powerful transformer-based sentence encoders such as SBERT [22] and TOD-BERT [28], and (ii) the local context window (i.e., preceding and next utterances) around the current utterance.

3.3 Efficiently learning locally dialogue-aware representations

ellodar increases training efficiency by using the previous and next utterances as only context (yet considers both directions) based on the observation that utterances in task-oriented dialogues surrounded by similar context windows often have the same conversational goals. Additionally, ellodar does not train an encoder from scratch as that would require significant computational efforts, and we envision a competitive but computationally efficient method. Rather, ellodar exploits the rich semantics captured in the embeddings produced by pretrained transformer-based sentence encoders.

3.3.1 Model description

ellodar combines two distinct strategies. In each strategy, a linear transformation is learned to transform an utterance x, as first encoded by a frozen pretrained sentence encoder $\phi (x)$, to a context-aware representation $f(\phi (x))$. We train different such transformations respectively for user and system representations ($f^\textsc {u}$ resp. $f^\textsc {s}$). The first strategy is designed to learn representations that are similar for utterances that (can) appear in the same context of preceding and following utterances. In practice, we only consider adjacent utterances as the context window, and the linear maps are learned by extrapolating the considered utterance x’s representation $\phi (x)$ onto those of the adjacent utterances.

More formally, the representation $f_{\textsc {ext},i}^{*}\in \mathbb {R}^{2h}$ for utterance $x_i$ is obtained from the pretrained encoder representation $\phi (x_i)\in \mathbb {R}^{h}$ (with the superscript $*\in \{\textsc {u}, \textsc {s}\}$ indicating the system or user), as

$$ f_{\textsc {ext},i}^{*}~{\triangleq } f_{\textsc {ext}}^{*}\left( \phi \left( x_i\right) \right) = W^*_{\textsc {ext}}\,\phi (x_i) + b^*_{\textsc {ext}} $$

The parameters $W^*_{\textsc {ext}} \in \mathbb {R}^{2h \times h}$ and $b^*_{\textsc {ext}} \in \mathbb {R}^{2h}$ are trained by minimizing a vector similarity loss $\mathcal {L}^*_{\textsc {ext}, i}$, i.e., ordinary least squares (${{\,\textrm{OLS}\,}}$):

$$ \mathcal {L}^*_{\textsc {ext},i} = {{\,\textrm{OLS}\,}}{\big (f_{\textsc {ext},i}^{*},\, \phi (x_{i-1})\oplus \phi (x_{i+1})\big )} $$

with $\oplus $ denoting concatenation. This is illustrated by the right-hand part of Fig. 2.

The second strategy interpolates the current user (system) embedding from the adjacent system (user) context embeddings, reflecting the assumption that context windows enclosing similar utterances should be represented close to each other in the utterance representation space. The corresponding representation $f_{\textsc {int},i}^{*}~\in \mathbb {R}^{h}$ for utterance $x_i$ is constructed from the pretrained encoder representations $\phi (x_{i-1})$ and $\phi (x_{i+1})$ of its adjacent^{Footnote 2} utterances as

$$ f_{\textsc {int},i}^{*}~\triangleq f_{\textsc {int}}^{*} \big (\phi (x_{i})\big ) = W^*_{\textsc {int}}\,\big (\phi (x_{i-1})\oplus \phi (x_{i+1})\big ) + b^*_{\textsc {int}} $$

with $W^*_{\textsc {int}} \in \mathbb {R}^{h \times 2h}$ and $b_{\textsc {int}}^* \in \mathbb {R}^h$. The corresponding loss is given by:

$$ \mathcal {L}^*_{\textsc {int}, i} = {{\,\textrm{OLS}\,}}{\big (f_{\textsc {int},i}^{*}\ ,\, \phi (x_{i})\big )} $$

A visual summary is given on the left part of Fig. 2. During training, the introduced loss terms are calculated and minimized over all utterances across all dialogues. After training, cellodar clusters the user utterances $x_i^{\textsc {u}}$ represented as $f_{\textsc {ext},i}^{\textsc {u}}$, $f_{\textsc {int},i}^{\textsc {u}}$ or $f_{\textsc {ext},i}^{\textsc {u}} \oplus f_{\textsc {int},i}^{\textsc {u}}$ (similarly for the system utterances).

3.3.2 Background

ellodar draws inspiration from the CBOW and skip-gram models [23] for learning word vectors, and especially $f_{\textsc {ext},i}^{*}$ bears similarities to the skip-thought model [20] for learning general purpose sentence embeddings. However, skip-thought employs two separate decoders to generate the preceding and following sentences, which (i) necessitates substantial computational efforts, (ii) produces sub-optimal sentence embeddings [21, 22] and (iii) requires hyperparameter tuning. In contrast, to specifically obtain dialogue aware representations for clustering, ellodar exploits pretrained sentence encoders (i) by efficiently learning linear transformations entirely in their encoding space (on CPU) with a vector-to-vector optimization objective, and thus (ii) directly optimizes the embeddings to capture the dialogue context necessary for DSI (rather than using an indirect decoding objective), and (iii) does not require hyperparameter tuning.

Other methods like DialoGPT [54], PLATO [55], and TOD-BERT [28], pretrain encoders on task-oriented dialogues to produce utterance representations that can be used in various downstream tasks, including clustering. It is worth noting that ellodar differs in that it does not (pre)train an encoder from scratch, but rather works complementary and out-of-the-box with such already pretrained encoders: ellodar adapts their representations to the task-specific dialogues by learning a linear transformation on top of them to specifically improve cluster performance. Therefore, ellodar’s efficient linear vector-to-vector regression is possible because pre-trained encoders, which already have undergone substantial computational efforts, enable this capability.

4 Experimental setup

We describe the datasets, and how they were adapted for DSI, in Section 4.1. In Section 4.2, we motivate our choices of three different types of pretrained sentence encoders that were used to train ellodar, and discuss the recent joint models and cluster baselines to which ellodar is compared in Section 4.3. We provide training details in Section 4.4, and extensively describe the evaluation methodology in Section 4.5.

4.1 Datasets

We follow prior works in unsupservised DSI [16, 17, 37] and conduct experiments on task-oriented dialogues that span 10 domains across four commonly used conversational datasets: DSTC2 [24], CamRest676 [25, 26], SimDial [27] and The Schema Guided dialogues (SGD) [5]. Our experiments comprise a broader range of datasets compared to prior works: the DVRNN model of [16] was benchmarked on SimDial and CamRest676, the SVRNN model of [17] solely on SimDial, and the model of [37] on SGD, CamRest676, and DSTC2. Our experiments cover all four datasets, thus making it the overall most comprehensive benchmark to date, to the best of our knowledge. SimDial contains synthetic dialogues that were generated using a pre-defined probabilistic grammar. The DTSC2 and SGD datasets consists of human-machine dialogues, whereas the human-human dialogues in CamRest676 were obtained with the Wizard-of-Oz methodology [56].

In the aforementioned datasets, utterances are annotated with intents, acts and slots. We discard slot values and only consider their types since we map utterances, rather than slots, to dialogue states and because a single type can have potentially many values which would make the number of dialogue states intractable. Moreover, utterances may have multiple annotations, in which case we combine them. For example, the utterance “I want to find a comedy movie. Search for movies now showing in Oakland” with as intent find-movies, as act inform, and as slot types genre and location, becomes [find-movies, inform.genre, inform.location] (ignoring the respective values comedy and Oakland). Thus, we obtain exactly one label for each utterance, allowing us to compare the induced dialogue states against the gold utterance labels with external cluster metrics, as will be discussed in Section 4.5. The gold number of $|V^{\textsc {u}} |$ user and $|V^{\textsc {s}} |$ system dialogue states are respectively set to the number of unique user and system utterance labels. The statistics of the various domains and datasets are shown in Table 1 and samples of dialogues are given in the Tables 14–16. We release our modified datasets such that they can be adopted in future works.

4.2 Pretrained sentence encoders

As discussed in Section 3.3.2, rather than pretraining an encoder from scratch, ellodar uses such already pretrained encoders out-of-the-box to produce utterance representations specifically for clustering. Since ellodar is thus agnostic to the sentence encoder, it can in principle be used with any such encoder $\phi $. For our experiments, we used three different types of models described below.

MiniLM-L6 [22]:^{Footnote 3} a general purpose sentence encoder that produces 384-dimensional vectors, offering a good trade-off between encoding speed and quality.
TOD-BERT-jnt [28]: a BERT_base model, yielding 768-dimensional embeddings, and pretrained with a next-sentence prediction and contrastive objective, on 9 task-oriented datasets that include CamRest676 and all domains in SGD. It was pretrained to encode utterances within dialogues, so that these encodings could be used in a variety of task-oriented downstream tasks. Note that while we chose TOD-BERT, other choices of task-oriented encoders such as, e.g., DialoGPT [54] and PLATO [55] are also possible.
GloVe [52]: utterances are represented as bag-of-words, i.e., their word-averaged GloVe embeddings. It is used as an ablation for the DVRNN and SVRNN models (see below) whose sentence encoders are initialized with GloVe, and as a baseline for the sentence encoders MiniLM and TOD-BERT.

4.3 Baselines

We aim to induce non-deterministic dialogue structures in task-oriented domains, as mentioned in Section 2.2. This same task is also considered by the joint DVRNN and SVRNN models, hence we use different configurations of these models as baselines for our cellodar approach. In addition, we compare cellodar to the cluster baselines of [18, 19] based on the used sentence encoders without ellodar training. Specifically, the baselines we will benchmark our own approaches against are:

DVRNN [16]: a discrete and recurrent extension of the Variational Auto-Encoder that learns to reconstruct the current turn from its discrete latent states and the preceding dialogue context. Turns are clustered into the discrete states.
SVRNN [17]: shares the same architecture as DVRNN but extends it with a structured attention mechanism over its hidden states.
Cluster baselines: utterances are clustered by using as input features their context window embeddings, represented as the concatenation of the embeddings of the utterances in the window. The utterance embeddings are obtained using the encoders of Section 4.2 and we consider as context windows (i) only the current utterance (indicated as c), as in prior works [16,17,18,19], (ii) the previous and current utterances, (pc), (iii) the full context window of previous, current, and next utterances (pcn).

Table 1 Dataset statistics

Revisiting clustering for efficient unsupervised dialogue structure induction

Abstract

Similar content being viewed by others

Fine-Tuning BERT for Generative Dialogue Domain Adaptation

Integrating breakdown detection into dialogue systems to improve knowledge management: encoding temporal utterances with memory attention

A Unified Generation Approach for Robust Dialogue State Tracking

1 Introduction

1.1 Research objective and contributions

2 Related work

2.1 Unsupervised dialogue act modeling

2.2 Unsupervised task-oriented dialogue structure induction

2.3 Unsupervised dialogue slot extraction

2.4 Applications of dialogue structures

3 Methodology

3.1 Task formulation

3.2 Cluster-based dialogue structure induction

3.3 Efficiently learning locally dialogue-aware representations

3.3.1 Model description

3.3.2 Background

4 Experimental setup

4.1 Datasets

4.2 Pretrained sentence encoders

4.3 Baselines

4.4 Training details

4.5 Evaluation

5 Results

6 Discussion

6.1 Joint methods versus cluster-based approaches

6.2 Comparing ellodar’s encoding schemes

6.3 Impact of local context on the performance of the cluster baselines

6.4 Impact of the sentence encoder on the structure quality

6.5 Overestimating the number of dialogue states

6.6 Training time performance

6.7 Ablation study

Impact of bidirectional context on structure quality

Impact of dialogue context width on structure quality

6.8 Qualitative analysis

Identification of common failure modes

The distribution of common failure modes

6.9 Limitations

Application domain

Reliance on the ground truth number of dialogue states

Dialogue context representation strategy

Training time performance analysis

Pre-training sentence encoders

Generalizability to additional human-human dialogues

7 Implications of the presented research results

8 Conclusions

Availability of data and materials

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical and informed consent for data used

Additional information

Publisher's Note

Appendix A: Additional results

Appendix A: Additional results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation