1 Introduction

Recently, systems for conversational modeling have enjoyed a lot of attention, including intent classification [1,2,3], dialogue state tracking [4,5,6], and slot filling [3, 7, 8]. Such systems have been used to automate customer service in the insurance and public transportation industries [9], to provide information in healthcare [10], and to perform tasks such as legal case retrieval [11] and conversational recommendation [12].

Fig. 1
figure 1

Dialogue Structure Induction (DSI): the structure (right) is induced from the set of N user-system dialogues (left), with the distinct user and system states as nodes, and transition probabilities as edges. Illustration based on SimDial [27]

Designing conversational agents, however, requires readily available annotated data. While companies often have access to an abundance of unlabeled dialogues, such as those exchanged between their customers and support agents, annotating them to develop conversational agents remains costly. Consequently, dialogue structure induction (DSI), aims to unsupervisedly recover the latent conversational structure from a set of task-oriented user-agent dialogues. Figure 1 shows an example of such a graph in which the nodes represent the distinct user and agent (system) dialogue states, and the possible transitions and corresponding probabilities between successive states are denoted by the edges. A dialogue structure can compactly summarize an entire collection of dialogues, providing companies with relevant insights about their customers and agents and they thus give a solid starting point for designing conversational models.

In unsupervised DSI, utterances with similar conversational goals are first clustered into the same dialogue state, and the structure (transition probabilities from one state to another) can then be recovered either directly from the model’s weights or by counting the number of transitions between successive states. Earlier work extended hidden Markov models to infer conversational graphs [13,14,15]. More recently, neural end-to-end models, e.g., DVRNN [16] and SVRNN [17], jointly learn to encode utterances and assign them to dialogue states. Yet, such neural models, (i) represent utterances by only considering the preceding dialogue context, (ii) require GPUs and tend to be slow at inducing dialogue structures as they are trained on a computationally expensive next turn decoding objective. Since DSI models embed the number of dialogue states in their architecture, they must be re-trained every time that number changes. It is thus important to induce dialogue structures efficiently, since in practice, users may need to experiment with different numbers of states to recover the optimal structure.

To address the weaknesses above, our work revisits and further builds upon the method of Gunasekara et al. [18, 19] that comprises two efficient steps in which utterances are first encoded into vectors and subsequently clustered. However, Gunasekara et al. [18, 19] represent utterances as bag-of-words or skip-thought [20] vectors, which have been shown to perform poorly in semantic similarity tasks [21, 22], and without considering dialogue context. In this work, we first demonstrate that encoding utterances by powerful transformer-based sentence encoders instead, already leads to improvements over recent joint models, in terms of both cluster metrics and being orders of magnitude faster at inducing the dialogue structure.

Next, we propose a highly efficient strategy to embed both preceding and subsequent dialogue context into utterance vector representations, called ellodar (for “Efficiently Learnt Locally Dialogue Aware Representations”), which further boosts performance in terms of cluster metrics. We cluster ellodar’s representations to induce dialogue structure, and refer to the complete procedure as cellodar. Regarding the aforementioned limitations of existing works, cellodar (i) uses both preceding and subsequent context, (ii) can be trained on CPU within seconds, and thus (iii) makes determining the number of dialogue states up to four orders of magnitude faster than recent joint models.

To obtain dialogue-aware embeddings before clustering, ellodar draws inspiration from the CBOW and skip-gram (so-called ‘word-to-vec’) models for learning word embeddings [23]: utterances with similar context windows, and context windows enclosing similar utterances, are represented closer to each other in the embedding space. ellodar is efficient (trains within seconds on CPU) as it learns a linear transformation with a vector-to-vector regression training objective in the encoding space of a frozen pretrained encoder by exploiting a local, yet bidirectional, context window. By casting representation learning as vector-to-vector regression, ellodar avoids the computational overhead incurred by decoding objectives, such as those used for training the joint DVRNN and SVRNN models.

Extensive experiments on 10 task-oriented domains spanning across the DSTC2 [24], CamRest676 [25, 26], SimDial [27] and Schema Guided dialogue [5] datasets, show that cellodar, yields absolute improvements over recently proposed joint methods of 7%–74% in standard cluster metrics while being 10 to \(10^4\) times faster.

1.1 Research objective and contributions

Our objective is not to outperform existing approaches by merely developing increasingly more complex models. Rather, our goal is to attain state-of-the-art performance while being highly efficient compute-wise, thereby making it feasible to induce dialogue structures in practice. More specifically, we aim to obtain a model that (i) outperforms the more complex joint models, i.e., DVRNN [16] and SVRNN [17], as measured by standard cluster metrics, and (ii) is sufficiently lightweight for inducing dialogue structures on accessible and cheap computing resources such as CPUs (rather than requiring GPUs like such joint models).

We summarize our contributions as follows:

  1. (1)

    We revisit the cluster baseline proposed in [18, 19], and demonstrate that clustering utterances encoded by transformer-based sentence encoders [22, 28] rather than by bag-of-words or skip-thoughts vectors, already outperforms the recent joint models for DSI [16, 17] in terms of inducing the correct dialogue structure, while being orders of magnitude faster.

  2. (2)

    We contribute ellodar, a highly efficient utterance representation learning approach that exploits local dialogue context to train linear transformations in the encoding space of a frozen sentence encoder using a vector-to-vector regression training objective. Clustering the ellodar representations (referred to as cellodar) is shown to outperform — by a large margin — the joint DVRNN and SVRNN models [16, 17] (while being orders of magnitude faster) as well as the improved transformer-based cluster baselines, on representative DSI datasets.

  3. (3)

    Since there exists no common benchmark for DSI, we releaseFootnote 1 our modified datasets, evaluation, and models, which we hope will spur future research in the unexplored DSI task.

2 Related work

We summarize previous research on the relatively unexplored task of unsupervised dialogue structure induction. There are many variations to this task, including both supervised and unsupervised statistical methods that learn structures based on dialogue acts, as discussed in Section 2.1. However, we specifically focus on unsupervised dialogue structure induction for task-oriented dialogues, for which Section 2.2 reviews recent joint models based on neural and variational approaches, and compares them to our proposed approach. In addition, we discuss methods for structure learning based on unsupervised slot extraction in Section 2.3, which is a related but distinct task. Finally, Section 2.4 outlines the various applications for which dialogue structures have been used.

2.1 Unsupervised dialogue act modeling

Early work focused on structure modeling of dialogues based on categorizing utterances into high-level dialogue acts (e.g., question, statement, request, and acknowledgment) and then learning the structure (transitions) among these acts (states). In [29], utterances are manually annotated with dialogue acts, and the general discourse structure is then inferred using stochastic grammars. Since labeling dialogue text thus requires expensive annotation, focus shifted to unsupervised dialogue act learning. Crook et al. [30] use Dirichlet Process Mixtures to cluster utterances into dialogue acts, but their approach does not model structural information that captures transitions between different acts. Therefore, to both model acts and learning the structure among them, Ritter et al. [14] combine hidden Markov and topic models to identify general discourse structure (i.e., dialogue acts) and dialogue-specific topics in non-task-oriented conversations. Joty et al. [31] further improve the approach of [14] by expanding the set of sentence features used to estimate the hidden Markov model’s act emission distribution to include the speaker, relative position, and sentence length in addition to unigrams. Similarly, the method of [32] uses hidden Markov models to model structural dependencies between dialogue acts, but instead estimate the act emission probabilities using Gaussian mixtures, enabling the use of real-valued sentence embeddings such as bag-of-words GloVe vectors to represent utterances, as opposed to discrete features [14, 31] such as, e.g., unigrams, and utterance length.

2.2 Unsupervised task-oriented dialogue structure induction

Rather than the aforementioned works on identifying high-level dialogue acts, another line of work focuses on the modeling of dialogue structures in task-oriented domains, with the aim of categorizing utterances into more fine-grained, task-specific intents. Early approaches, such as those of [13, 15], adopt hidden Markov models (HMMs) to cluster text spans in task-oriented dialogues into states and learn the dependencies between them. Zhai et al. [15] follow a similar approach to the above cited [14], but consider task-oriented dialogues, assuming that utterance words are generated from a mixture of topic models shared across all states rather than having a single model per state.

To better capture the highly non-linear dynamics in dialogues [33], recent solutions have shifted away from simple HMMs towards neural end-to-end models that jointly learn to encode and cluster utterances to induce task-oriented dialogue structures. Shi et al. [16] propose the use of Discrete Variational Recurrent Neural Networks (DVRNNs) to assign turns to discrete latent states, decoding the current turn from its predicted state and the preceding turns. Qiu et al. [17] extend DVRNNs to SVRNNs by adding structured attention [34] over its hidden states, enforcing a structural inductive bias that is more aligned with DSI. In [35], a modification of the DVRNN model that separates user and system utterances instead of treating them jointly is proposed, leading to more accurate assignment of system actions to states. However, the approach of [35] relies on weak supervision from database queries performed by a human at some point in the dialogue, whereas the unsupervised DVRNN and SVRNN models do not require such (weak) supervision. Rather than inducing dialogue structures in task-oriented domains, Xu et al. [36] induce them in an open-domain setting, using a combination of discrete variational models with graph neural networks to hierarchically discover different domains and then learning the structure within each domain. To obtain more easily interpretable structures, Sun et al. [37] propose an Edge-Enhanced Graph Auto-Encoder that induces deterministic dialogue structures.

Our work focuses on unsupervised induction of non-deterministic dialogue structures in task-oriented domains, given that transitions between dialogue states are inherently probabilistic. We thus focus on the same task as the DVRNN [16] and SVRNN [17] models that jointly learn to encode and cluster utterances. However, both those models (i) only consider preceding dialogue context and, because they are based on Variational Auto-Encoders optimized with a next turn decoding objective, they (ii) are slow to train, and (iii) are susceptible to the posterior collapse [38,39,40]. Posterior collapse occurs when the model relies solely on the decoder’s auto-regressive properties to reconstruct inputs, thus bypassing the latent states altogether, which may result in utterances with the distinct conversational goal being erroneously assigned to the same state.

To address these limitations (i)–(iii), our work builds on the method of [18, 19] that comprises two efficient steps: utterances are first (1) encoded as vectors and then (2) clustered into dialogue states (e.g., using k-means). Clustering assigns utterances to states based on vector similarities rather than on an indirect decoding objective. However, the methods used in [18, 19] for representing utterances as vectors, such as bag-of-words and skip-thought vectors, are sub-optimal for semantic similarity tasks [21, 22]. Furthermore, since these bag-of-words or skip-thought vectors are not fine-tuned on task-specific dialogues, the approach of [18, 19] does not utilize dialogue context. Here, we first experiment with using more powerful transformer-based encoders like SBERT [22] and TOD-BERT [28] that are better suited for semantic similarity tasks. Then, we propose ellodar as a method for obtaining task-specific contextual utterance representations by building upon an already pretrained transformer encoder, which is kept frozen, and subsequently learning a linear transformation on top of it with a vector-to-vector regression objective, using both preceding and subsequent context.

2.3 Unsupervised dialogue slot extraction

Similar to our current work, the methods discussed in Sections 2.1 and 2.2 induce dialogue structures by mapping utterances to states. In the related but different slot-based dialogue structure induction task, words or subphrases rather than utterances are mapped to states in task-oriented domains. To this end, Hudeček et al. [41] use weak supervision from rule-based parsers to identify potential slot candidates, which are then clustered into task-specific slots. Qiu et al. [42] employ transfer learning instead, using supervision from domains with available slot annotations to first train a model that detects slot boundaries. The obtained slot boundary detection model is then applied to unseen domains to identify slot candidates, which are subsequently clustered into states. Vukovic et al. [43] extend the transfer learning method of [42] by starting from the same slot boundary detection model, but using topological data analysis methods to increase the recall of the candidate slot extraction step. Rather than extracting slots through weak-supervision or transfer learning, the method of [44] extracts slots completely unsupervised by using self-supervised language models trained on the task-specific dialogues and unsupervised parsers to identify slot candidates, after which these are similarly clustered to obtain slot states.

2.4 Applications of dialogue structures

While in our current paper, we solely focus on structure induction as an information extraction task, the inferred dialogue structure may be further used for other applications. In particular, it can be used for (i) accelerating dialogue policy learning [16, 45, 46], (ii) more controllable and coherent dialogue agents, in open domain [36, 47] and domain-specific settings [48], (iii) response generation in multi-party dialogues [49], (iv) low-resource dialogue state tracking [37], and (v) zero-shot policy learning generalizing beyond a single domain [50].

3 Methodology

In Section 3.1, the DSI task is formalized. We specifically focus on recovering dialogue structures from task-oriented dialogues (Section 2.2), in which there are typically two parties who exchange utterances consecutively [5, 24,25,26,27, 51]. We will refer to the two parties in the dialogues as ‘users’ and ‘systems’ respectively, with the ‘system’ utterances generated by, e.g., a support agent in response to requests from a client (‘user’). We describe the cluster-based approach of [18, 19] in Section 3.2, followed by our proposed ellodar strategy to obtain utterance representations in Section 3.3.

3.1 Task formulation

We are given a set \(\mathcal {D}\) containing N dialogues between users and system. Each dialogue \(d\in \mathcal {D}\) is a sequence of n utterances, alternating between user utterances \(x^{\textsc {u}}\) and system utterances \(x^{\textsc {s}}\) (or vice versa): \(\big [x_1^{\textsc {u}}, x_2^{\textsc {s}}, x_3^{\textsc {u}},\ldots , x_n^{\textsc {s}}\big ]\). Unsupervised dialogue structure induction aims to infer from \(\mathcal {D}\) the conversational graph (VE) with vertices V and edges E. To this end, those utterances that have a common conversational goal (‘intent’) are mapped onto a common dialogue state \(v\in V\) across the corpus. User utterances \(x^{\textsc {u}}\) are mapped to a user dialogue state \(v\in ~V^{\textsc {u}}\), and system utterances \(x^{\textsc {s}}\) onto a system dialogue state \(v\in V^{\textsc {s}}\), whereby \(V=~V^{\textsc {u}}~\cup V^{\textsc {s}}\) and \(V^{\textsc {u}}~\cap V^{\textsc {s}} = \emptyset \). Assigning utterances to the correct state depends on the conversational context such that two utterances with the same wording but in a different dialogue may refer to different dialogue states. The edges \(e_{ij} \in E\) represent the probability \(p_{i,j}\) of transitioning from state \(v_i\) to \(v_j\) when following the conversation. Given the alternating user and system utterances in a dialogue, it is assumed that state transitions happen from a user to a system state or vice versa: \(\forall (v _{i}, v_{j}) \in ~V^{\textsc {u}} \times V^{\textsc {u}}: p_{v_{i},v_{j}} = 0\) (similar for \(V^{\textsc {s}}\)).

3.2 Cluster-based dialogue structure induction

We consider the cluster-based method of [18, 19], frequently adopted as a baseline for DSI, which encodes utterances as vectors and then clusters them into the \(|V^{\textsc {u}} |\) user and \(|V^{\textsc {s}} |\) system states. The transition probabilities \(p_{i,j}\) between states \(v_i, v_j \in V\) are computed by counting the number of utterances in \(v_i\) for which the utterance that follows is in \(v_j\) and then normalize by dividing by the total number of utterances in \(v_i\):

$$\begin{aligned} p_{i,j} = \frac{\#(v_i \rightarrow v_j)}{\#v_i} \end{aligned}$$

Works that compare against this cluster-based method (i) use sub-optimal embeddings and (ii) do not use dialogue context. In particular, only the current utterance is encoded as a bag-of-words using GloVe [32, 52], word2vec [16, 23] or BERT [17, 37, 53]. Yet, such methods have been shown to produce sentence embeddings of low quality [21, 22]. Thus, we propose ellodar, to efficiently learn locally dialogue-aware representations, by using (i) more powerful transformer-based sentence encoders such as SBERT [22] and TOD-BERT [28], and (ii) the local context window (i.e., preceding and next utterances) around the current utterance.

3.3 Efficiently learning locally dialogue-aware representations

ellodar increases training efficiency by using the previous and next utterances as only context (yet considers both directions) based on the observation that utterances in task-oriented dialogues surrounded by similar context windows often have the same conversational goals. Additionally, ellodar does not train an encoder from scratch as that would require significant computational efforts, and we envision a competitive but computationally efficient method. Rather, ellodar exploits the rich semantics captured in the embeddings produced by pretrained transformer-based sentence encoders.

3.3.1 Model description

ellodar combines two distinct strategies. In each strategy, a linear transformation is learned to transform an utterance x, as first encoded by a frozen pretrained sentence encoder \(\phi (x)\), to a context-aware representation \(f(\phi (x))\). We train different such transformations respectively for user and system representations (\(f^\textsc {u}\) resp. \(f^\textsc {s}\)). The first strategy is designed to learn representations that are similar for utterances that (can) appear in the same context of preceding and following utterances. In practice, we only consider adjacent utterances as the context window, and the linear maps are learned by extrapolating the considered utterance x’s representation \(\phi (x)\) onto those of the adjacent utterances.

More formally, the representation \(f_{\textsc {ext},i}^{*}\in \mathbb {R}^{2h}\) for utterance \(x_i\) is obtained from the pretrained encoder representation \(\phi (x_i)\in \mathbb {R}^{h}\) (with the superscript \(*\in \{\textsc {u}, \textsc {s}\}\) indicating the system or user), as

$$ f_{\textsc {ext},i}^{*}~{\triangleq } f_{\textsc {ext}}^{*}\left( \phi \left( x_i\right) \right) = W^*_{\textsc {ext}}\,\phi (x_i) + b^*_{\textsc {ext}} $$

The parameters \(W^*_{\textsc {ext}} \in \mathbb {R}^{2h \times h}\) and \(b^*_{\textsc {ext}} \in \mathbb {R}^{2h}\) are trained by minimizing a vector similarity loss \(\mathcal {L}^*_{\textsc {ext}, i}\), i.e., ordinary least squares (\({{\,\textrm{OLS}\,}}\)):

$$ \mathcal {L}^*_{\textsc {ext},i} = {{\,\textrm{OLS}\,}}{\big (f_{\textsc {ext},i}^{*},\, \phi (x_{i-1})\oplus \phi (x_{i+1})\big )} $$

with \(\oplus \) denoting concatenation. This is illustrated by the right-hand part of Fig. 2.

Fig. 2
figure 2

Training strategies of ellodar, where \(x_i\) are dialogue utterances, and Encoder is a pretrained sentence encoder. Left: \(f^*_{\textsc {int}}\), which predicts the embedding for the current utterance from local context embedding (i.e., the preceding and following utterances in a dialogue). Right: \(f^*_{\textsc {ext}}\), which predicts the context embedding from the current one

The second strategy interpolates the current user (system) embedding from the adjacent system (user) context embeddings, reflecting the assumption that context windows enclosing similar utterances should be represented close to each other in the utterance representation space. The corresponding representation \(f_{\textsc {int},i}^{*}~\in \mathbb {R}^{h}\) for utterance \(x_i\) is constructed from the pretrained encoder representations \(\phi (x_{i-1})\) and \(\phi (x_{i+1})\) of its adjacentFootnote 2 utterances as

$$ f_{\textsc {int},i}^{*}~\triangleq f_{\textsc {int}}^{*} \big (\phi (x_{i})\big ) = W^*_{\textsc {int}}\,\big (\phi (x_{i-1})\oplus \phi (x_{i+1})\big ) + b^*_{\textsc {int}} $$

with \(W^*_{\textsc {int}} \in \mathbb {R}^{h \times 2h}\) and \(b_{\textsc {int}}^* \in \mathbb {R}^h\). The corresponding loss is given by:

$$ \mathcal {L}^*_{\textsc {int}, i} = {{\,\textrm{OLS}\,}}{\big (f_{\textsc {int},i}^{*}\ ,\, \phi (x_{i})\big )} $$

A visual summary is given on the left part of Fig. 2. During training, the introduced loss terms are calculated and minimized over all utterances across all dialogues. After training, cellodar clusters the user utterances \(x_i^{\textsc {u}}\) represented as \(f_{\textsc {ext},i}^{\textsc {u}}\), \(f_{\textsc {int},i}^{\textsc {u}}\) or \(f_{\textsc {ext},i}^{\textsc {u}} \oplus f_{\textsc {int},i}^{\textsc {u}}\) (similarly for the system utterances).

3.3.2 Background

ellodar draws inspiration from the CBOW and skip-gram models [23] for learning word vectors, and especially \(f_{\textsc {ext},i}^{*}\) bears similarities to the skip-thought model [20] for learning general purpose sentence embeddings. However, skip-thought employs two separate decoders to generate the preceding and following sentences, which (i) necessitates substantial computational efforts, (ii) produces sub-optimal sentence embeddings [21, 22] and (iii) requires hyperparameter tuning. In contrast, to specifically obtain dialogue aware representations for clustering, ellodar exploits pretrained sentence encoders (i) by efficiently learning linear transformations entirely in their encoding space (on CPU) with a vector-to-vector optimization objective, and thus (ii) directly optimizes the embeddings to capture the dialogue context necessary for DSI (rather than using an indirect decoding objective), and (iii) does not require hyperparameter tuning.

Other methods like DialoGPT [54], PLATO [55], and TOD-BERT [28], pretrain encoders on task-oriented dialogues to produce utterance representations that can be used in various downstream tasks, including clustering. It is worth noting that ellodar differs in that it does not (pre)train an encoder from scratch, but rather works complementary and out-of-the-box with such already pretrained encoders: ellodar adapts their representations to the task-specific dialogues by learning a linear transformation on top of them to specifically improve cluster performance. Therefore, ellodar’s efficient linear vector-to-vector regression is possible because pre-trained encoders, which already have undergone substantial computational efforts, enable this capability.

4 Experimental setup

We describe the datasets, and how they were adapted for DSI, in Section 4.1. In Section 4.2, we motivate our choices of three different types of pretrained sentence encoders that were used to train ellodar, and discuss the recent joint models and cluster baselines to which ellodar is compared in Section 4.3. We provide training details in Section 4.4, and extensively describe the evaluation methodology in Section 4.5.

4.1 Datasets

We follow prior works in unsupservised DSI [16, 17, 37] and conduct experiments on task-oriented dialogues that span 10 domains across four commonly used conversational datasets: DSTC2 [24], CamRest676 [25, 26], SimDial [27] and The Schema Guided dialogues (SGD) [5]. Our experiments comprise a broader range of datasets compared to prior works: the DVRNN model of [16] was benchmarked on SimDial and CamRest676, the SVRNN model of [17] solely on SimDial, and the model of [37] on SGD, CamRest676, and DSTC2. Our experiments cover all four datasets, thus making it the overall most comprehensive benchmark to date, to the best of our knowledge. SimDial contains synthetic dialogues that were generated using a pre-defined probabilistic grammar. The DTSC2 and SGD datasets consists of human-machine dialogues, whereas the human-human dialogues in CamRest676 were obtained with the Wizard-of-Oz methodology [56].

In the aforementioned datasets, utterances are annotated with intents, acts and slots. We discard slot values and only consider their types since we map utterances, rather than slots, to dialogue states and because a single type can have potentially many values which would make the number of dialogue states intractable. Moreover, utterances may have multiple annotations, in which case we combine them. For example, the utterance “I want to find a comedy movie. Search for movies now showing in Oakland” with as intent find-movies, as act inform, and as slot types genre and location, becomes [find-movies, inform.genre, inform.location] (ignoring the respective values comedy and Oakland). Thus, we obtain exactly one label for each utterance, allowing us to compare the induced dialogue states against the gold utterance labels with external cluster metrics, as will be discussed in Section 4.5. The gold number of \(|V^{\textsc {u}} |\) user and \(|V^{\textsc {s}} |\) system dialogue states are respectively set to the number of unique user and system utterance labels. The statistics of the various domains and datasets are shown in Table 1 and samples of dialogues are given in the Tables 1416. We release our modified datasets such that they can be adopted in future works.

4.2 Pretrained sentence encoders

As discussed in Section 3.3.2, rather than pretraining an encoder from scratch, ellodar uses such already pretrained encoders out-of-the-box to produce utterance representations specifically for clustering. Since ellodar is thus agnostic to the sentence encoder, it can in principle be used with any such encoder \(\phi \). For our experiments, we used three different types of models described below.

  • MiniLM-L6 [22]:Footnote 3 a general purpose sentence encoder that produces 384-dimensional vectors, offering a good trade-off between encoding speed and quality.

  • TOD-BERT-jnt [28]: a BERTbase model, yielding 768-dimensional embeddings, and pretrained with a next-sentence prediction and contrastive objective, on 9 task-oriented datasets that include CamRest676 and all domains in SGD. It was pretrained to encode utterances within dialogues, so that these encodings could be used in a variety of task-oriented downstream tasks. Note that while we chose TOD-BERT, other choices of task-oriented encoders such as, e.g., DialoGPT [54] and PLATO [55] are also possible.

  • GloVe [52]: utterances are represented as bag-of-words, i.e., their word-averaged GloVe embeddings. It is used as an ablation for the DVRNN and SVRNN models (see below) whose sentence encoders are initialized with GloVe, and as a baseline for the sentence encoders MiniLM and TOD-BERT.

4.3 Baselines

We aim to induce non-deterministic dialogue structures in task-oriented domains, as mentioned in Section 2.2. This same task is also considered by the joint DVRNN and SVRNN models, hence we use different configurations of these models as baselines for our cellodar approach. In addition, we compare cellodar to the cluster baselines of [18, 19] based on the used sentence encoders without ellodar training. Specifically, the baselines we will benchmark our own approaches against are:

  • DVRNN [16]: a discrete and recurrent extension of the Variational Auto-Encoder that learns to reconstruct the current turn from its discrete latent states and the preceding dialogue context. Turns are clustered into the discrete states.

  • SVRNN [17]: shares the same architecture as DVRNN but extends it with a structured attention mechanism over its hidden states.

  • Cluster baselines: utterances are clustered by using as input features their context window embeddings, represented as the concatenation of the embeddings of the utterances in the window. The utterance embeddings are obtained using the encoders of Section 4.2 and we consider as context windows (i) only the current utterance (indicated as c), as in prior works [16,17,18,19], (ii) the previous and current utterances, (pc), (iii) the full context window of previous, current, and next utterances (pcn).

Table 1 Dataset statistics

In contrast to existing works [16, 17] that compare only with cluster baseline (i), we additionally benchmark against (ii) and (iii), which serve as stronger baselines as they use additional dialogue context, similar to our cellodar approach.

4.4 Training details

As explained in Section 3.3, we use ordinary least squares to estimate the weights of the linear regression functions of ellodar (to obtain the representations \(f_{\textsc {ext},}^{\textsc {u}}\),\(f_{\textsc {int},}^{\textsc {u}}\),\(f_{\textsc {ext},}^{\textsc {s}}\) and \(f_{\textsc {int},}^{\textsc {s}}\)) and thus do not require hyperparameter tuning. For both cellodar and the cluster baselines, we use k-means to separately cluster the user utterances \(x^{\textsc {u}}\) and the system utterances \(x^{\textsc {s}}\) into respectively the gold number of user and system dialogue states, \(|V^{\textsc {u}} |\) and \(|V^{\textsc {s}} |\). We use 10 random seeds to initialize k-means and report the average scores over these 10 runs for CamRest676 and DSTC2. For SimDial, the results presented in the main body are further averaged over the 4 domains Weather, Bus, Restaurant, and Movies. Similarly, for SGD, we further average over the 4 domains Events, Homes, Music, and Movies. Scores for the individual domains, as well as mean (± standard deviation) over the domains are given in Appendix A.

4.5 Evaluation

Shi et al. [16] opted for a qualitative evaluation in which humans rated induced conversational graphs. Qiu et al. [17] presented two automatic metrics to quantitatively assess the quality of such graphs. They introduced Structure Euclidean Distance and Structure Cross-Entropy, which both estimate a probabilistic mapping between the induced and the gold states. However, the authors later deemed them unstableFootnote 4 because of their high variance and recommended instead to employ external cluster metrics for evaluating induced conversational graphs based on slot clusters [42].

In Section 4.1, we described how to obtain labels for utterance-based DSI, enabling us to also adopt such metrics, and more in particular: (i) the adjusted rand index (ARI) [57], (ii) the adjusted mutual information (AMI) [58] and (iii) the Fowlkes-Mallows score (FM) [59]. ARI and AMI extend respectively the rand index [60] and the mutual information to adjust for chance: random clusters obtain a score of 0.0 whereas perfect ones obtain 1.0. The rand index measures, out of all pairs of samples, the percentage of correct ones. A pair is correct when either (i) both samples have the same gold label and they are assigned to the same cluster, or (ii) both samples have a different gold label and they are mapped to a different cluster Mutual information, on the other hand, relates to purity and assigns a high score to clusters if the majority of their samples have the same label.

DVRNN and SVRNN cluster turns \((x_i^{\textsc {u}}, x_{i+1}^{\textsc {s}})\) of consecutive (user, system) utterances into turn states \(v^{\textsc {turn}}\in V^{\textsc {turn}}~\subseteq V^{\textsc {u}} \times V^{\textsc {s}}\) (or vice versa). The number of turn states \(|V^{\textsc {turn}} |\) corresponds to the unique number of turn labels, i.e., the combination of labels of the turn’s utterances. Turn clustering becomes challenging when states contain few utterances because the turn states will become even sparser, e.g., in Table 1 with for DSTC2: \(|V^{\textsc {turn}} |= 756 \gg |V^{\textsc {u}} |+ |V^{\textsc {s}} |\). To allow for a fair comparison with DVRNN and SVRNN, we report turn state cluster results on SGD and SimDial, for cellodar and the cluster baselines. These are automatically inferred by combining the separately induced cluster identifiers of the system and user utterances that comprise a turn.Footnote 5 In addition, we report utterance-based results for cellodar and the cluster baselines on CamRest676 and DSTC2. Note that CamRest676 lacks annotations for many system utterances, and the gold turn-based states for DSTC2 become very sparse (\(|V^{\textsc {turn}} |= 756\)). Therefore, obtaining turn-based results for CamRest676 and DSTC2 appeared not feasible, preventing the comparison of our models with DVRRN and SVRNN on these datasets (see Table 3).

5 Results

As the joint models DVRRN and SVRNN are initialized with GloVe embeddings, we first report results for the cluster baselines and cellodar also based on GloVe, and thereby eliminating the advantage that could be attributed to the use of pretrained transformers in our most competitive models. Table 2 shows that the cluster baselines outperform the joint models in almost all cases. Only for SimDial does the bag-of-words model of [18, 19] (GloVec) perform worse in terms of ARI and FM. Most notably, the strongest baseline (pcn) surpasses SVRNN on SimDial (SGD) by +49.4 (+10) percentage points in ARI, +34.9 (+27.3) in AMI, and +60.8 (+7.6) in FM. Moreover, the best cellodar model consistently outperforms the best cluster baseline, with further improvements on SimDial (SGD) of +23.3 (+3.8) in ARI, +13.1 (+1.2) in AMI, and +19.3 (+4.1) in FM. The key observations from this GloVe-based comparison are: (i) all cluster baselines, except the bag-of-words model of [18, 19] (GloVec), outperform the joint models, and (ii) the best cellodar-model consistently outperforms the best cluster baseline.

Table 2 Main results with GloVe

Table 3 shows that these observations also hold for the sentence encoders MiniLM and TOD-BERT, with cellodar and the cluster baselines outperforming their counterparts based on the bag-of-words GloVe encoder. Note that the models of [18, 19] (subscripted by c) with MiniLM and TOD-BERT consistently outperform the joint models, which was not always the case for GloVe.

Table 4 reports the training times of cellodar, and the joint DVRNN and SVRNN models. First, we discuss the computational resources required to train each model. DVRNN and SVRNN are built on the same code base.Footnote 6 We adopt the hyperparameters from [17] with: (i) dropout set to 0.5, (ii) Adam as optimizer, (iii) a learning rate of 0.001, and (iv) 60 epochs. DVRNN is trained on a single GTX 1080 Ti GPU, using 40 dialogues per batch for all SimDial and all SGD domains. SVRNN uses a single Tesla V100 with batch size 40 for all SimDial domains and size 10 for all SGD domains (we could not fit more in memory). In contrast, cellodar uses a single 2.6 GHz Intel Core i7 to first learn its representations with ordinary least squares, and then cluster them with k-means [61], with 1,000 as the maximum iterations, and its k centroids initialized by k++means [62]. On a Tesla V100 GPU, MiniLM and TOD-BERT have encodings speedsFootnote 7 of respectively 14,200 and 2,800 utterances/second. Encoding the largest considered dataset then takes respectively 1.84 and 9.34 seconds for MiniLM and TOD-BERT. When adding 9.34 seconds for the worst case encoding speed to the average of 15.2 seconds to both learn and cluster representations, our slowest model, TOD-BERTint+ext, achieves a speedup of 89\(\times \,\) compared to DVRNN and 4,909\(\times \,\) compared to SVRNN, as shown in Table 4. Encoding sentences with MiniLM rather than with TOD-BERT, results in further speedups, making it 279\(\times \,\) and 15,894\(\times \,\) faster than DVRNN and SVRNN, respectively.

Table 3 Main results with sentence encoders
Table 4 Training times

6 Discussion

In Section 6.1, we compare the recent joint models to the cluster-based methods, i.e., cluster baselines and cellodar. Next, we compare the performance between ellodar’s two encoding strategies int and ext in Section 6.2. The effect of including local context by vector concatenation on the cluster baselines’ performance is analyzed in Section 6.3. We discuss the impact of using a bag-of-words, general purpose, or task-oriented sentence encoder on the cluster performance in Section 6.4. Then, in Section 6.5, we vary the gold number of dialogue states used as input for the clustering algorithm, to analyze cellodar’s effectiveness if that gold number of states is unknown. We compare in Section 6.6 the training time performance of the joint models to that of cellodar, and conclude in Section 6.9 by discussing the limitations of this work.

6.1 Joint methods versus cluster-based approaches

Tables 2 and 3 show that the joint methods are outperformed consistently by the cluster baselines and cellodar. Also evidenced by the low AMI, ARI, and FM scores, we observed that the joint models frequently clustered utterances with different ground-truth labels into the same state. As the joint models are based on variational auto-encoders, optimized with a next turn decoding objective, we hypothesize that their poor performance is caused by the posterior collapse [38,39,40]. The latter occurs when the model solely relies on the decoder’s auto-regressive properties rather than on the latent states to decode the next turn. That is, even if the joint models ignore the latent states entirely, they may still attain a small decoding loss. Hence explaining why utterances with different ground-truth states are often incorrectly assigned to the same state. The cluster baselines and cellodar, on the other hand, do not rely on such decoding objectives, but instead induce dialogue states with k-means and thus directly exploit similarities between vector representations of utterances.

Moreover, the results in Section 5 demonstrate that the best cellodar models consistently outperform the best cluster baselines. Unlike ellodar, the cluster baselines do not learn to incorporate local dialogue context into utterance representations, instead they simply concatenate representations. This indicates that learning how to include local context into representations is beneficial for DSI, and that ellodar’s learning schemes are successful at doing so. We consider ellodar the main technical contribution of this work.

6.2 Comparing ellodar’s encoding schemes

We note that int and ext encodings of the utterance take different views: while ext aims to reconstruct a representation of the context from an utterance \(\phi (x)\) itself, int rather aims to reconstruct the utterance representation \(\phi (x)\). Given this complementary mechanism, we a priori expect their combination (int+ext) to perform best, while superiority of one over the other cannot be intuitively anticipated. The results in Tables 2 and 3 reveal that for SimDial all 3 encodings perform nearly perfectly,Footnote 8 which prevents us from distinguishing their performance. Still, for both SGD and CamRest676, int+ext performs notably better than int, and slightly better than ext, thus confirming our a priori expectation. Somewhat surprisingly, on DSTC2 int+ext clearly performs worse than ext. We can however attribute this to the fact that DSTC2 comprises human-to-chatbot dialogues where the bot frequently misinterprets the user, thus leading to contexts that are sometimes disconnected from an enclosed utterance: as a result (erroneous) context information from int is not as useful, as also reflected in low int scores.

6.3 Impact of local context on the performance of the cluster baselines

We investigate the effect on structure quality of using the two straightforward vector concatenation approaches for incorporating preceding (pc), and both preceding and subsequent (pcn) context. This contrasts with the model of [18, 19] that uses no context and which was later adopted as a baseline in [16, 17]. Intuitively, we expect the cluster baselines that leverage full context (pcn) to perform better than those using only the preceding (pc) or no context at all (c). Tables 2 and 3 reveal that on SimDial and SGD, the cluster metrics indeed consistently improve as the context window expands: c<pc<pcn. Conversely, the results for CamRest676 and DSTC2 get worse as the context window grows larger. The CamRest676 results indicate that naively including context does not always improve structure quality, emphasizing the benefits of using more advances strategies like ext and int+ext. On DSTC2, the difference is even more apparent, with the cluster baselines of c clearly outperforming those of pcn, which is consistent with the previously discussed results of int and ext and thus attributed to the erroneous context from the human-to-chatbot dialogues. Still, we recommend adopting pc and pcn as baselines, since they significantly improve the average structure quality on all 4 SimDial and SGD domains.

6.4 Impact of the sentence encoder on the structure quality

First, as Tables 2 and 3 show, both the cluster baselines and the cellodar models based on bag-of-words representations (GloVe) perform consistently worse than their counterparts based on powerful sentence encoders (MiniLM and TOD-BERT), supporting our claim that transformer-based encoders are better for DSI.

Second, we investigate whether TOD-BERT, specifically trained to encode utterances in dialogues, outperforms the general purpose encoder MiniLM. The results in Table 3 are mixed. Since TOD-BERT was trained on all 16 SGD domains, including the 4 that we consider, we indeed find that on SGD, TOD-BERT models consistently outperform those based on MiniLM, notably for int+ext (improvements of +9.4, +2.4, and +12.4 for the ARI, AMI, and FM metrics respectively). Since TOD-BERT is also trained on all CamRest676 dialogues, it is surprising that MiniLM outperforms it. We hypothesize that this is due to the fact that the CamRest dialogues (i) comprise only 0.67% of the total dialogues used to train TOD-BERT (rather than the SGD dialogues accounting for 22.66%), and (ii) are dissimilar to those of SGD, such that little transfer occurs. Furthermore, the results on SimDial and DSTC2 (which were not used to train TOD-BERT) vary, with TOD-BERT outperforming MiniLM for some models but not for others, making it difficult to draw conclusions about the transferability to unseen domains.

In summary, the preliminary evidence on SGD suggests that it may be beneficial to pretrain sentence encoders specifically on the dialogues from which the structure is induced. The advantages of transferring to dialogues from unseen datasets (SimDial, DSTC2), however, remain unclear.

6.5 Overestimating the number of dialogue states

We assumed the gold number of the \(|V^{\textsc {u}} |\) user and \(|V^{\textsc {s}} |\) system states to be known and used them to initialize k-means. In practice, \(|V^{\textsc {u}} |\) and \(|V^{\textsc {s}} |\) can be estimated by inspecting a subset of dialogues, but determining them exactly, however, is challenging. To this end, we investigate the effect of overestimating the number of states by initializing k-means with twice the gold number of user and system states: \(k=2\cdot |V^{\textsc {u}} |\) and \(k=2\cdot |V^{\textsc {s}} |\).

We present MiniLM results for the best cluster baseline (pcn) and the best cellodar model (int+ext), both with the overestimated number of clusters, and compare them to their counterparts, as well as the DVRNN and SVRNN models with the gold number of states.

First, Table 5 shows that the overestimated cluster baseline and int+ext still outperform DVRNN and SVRNN in all metrics and on all datasets, with notable improvements for MiniLMpcn (MiniLMint+ext) in AMI: +14.9 (+16.3) on SimDial, and +27.4 (+29.4) on SGD.

Table 5 Overestimating the number of dialogue states

Second, when comparing the overestimated models to their counterparts initialized with the gold number, we find that the overestimated models (i) drop in ARI and FM, and (ii) drop in AMI but MiniLMpcn (MiniLMint+ext) still attain relatively high values of 76.2 (77.6) on SimDial and 52.7 (54.7) on SGD. Since the number of clusters increased twofold, utterances of the same gold state can be partitioned further into different clusters. Therefore, the decrease in ARI and FM is expected since these metrics penalize utterances of the same gold state if they are mapped to different clusters. AMI, on the other hand, measures cluster purity, with a high score indicating that most utterances in a cluster belong to the same gold state.

Thus, even when the number of clusters is overestimated by a factor of two, the cluster baseline and cellodar induce relatively pure clusters, with the latter outperforming the former, and both still considerably better than the DVRNN and SVRNN with the gold number of states.

6.6 Training time performance

In Section 5, we reported that our slowest cellodar model achieved a speedup of 89\(\times \) over DVRNN and 4,909\(\times \) over SVRNN. This efficiency gap can be attributed to the fact that joint models are optimized with stochastic gradient descent (SGD), whereas cellodar is trained with more efficient learning schemes. Training neural networks with SGD requires multiple epochs of forward and backward passes through all training samples before converging to a local minimum, and thus, as per [16, 17], we used 60 epochs to train the joint models. Although cellodar relies on neural networks (MiniLM and TOD-BERT) to obtain sentence representations, encoding all training samples requires just a single forward pass. Similarly, ellodar’s linear transformations are cast as vector-to-vector regression and thus can be learned with ordinary least squares in a single pass. As [61]’s algorithm for k-means has efficient implementations [63], clustering the ellodar representations is fast.

The training time difference between cellodar based on TOD-BERT and on MiniLM is twofold. With 3M parameters compared to 110M, MiniLM encodes sentences much faster than TOD-BERT. Additionally, MiniLM produces 384-dimensional vectors, while TOD-BERT produces vectors with twice the number of dimensions (i.e., 768). As k-means runtime depends on the number of input features, clustering MiniLM’s representations is thus faster than clustering TOD-BERT’s.

Table 6 Impact of bidirectional context on structure quality

6.7 Ablation study

We provide ablations to assess the impact of ellodar’s different components. First, we examine if training ellodar with bidirectional context, i.e., both preceding and subsequent dialogue, improves structure quality compared to training ellodar with only preceding or subsequent context. Second, since ellodar uses a local context window (the preceding and subsequent utterance) for efficient representation learning, we explore whether training on larger context windows is useful.

Impact of bidirectional context on structure quality

To assess the impact of bidirectional context on cluster performance, we compare ellodar’s strategies: int, ext, and int+ext, trained with only the preceding (P) or next (N) utterance as context, against ellodar’s standard bidirectional (PN) context. For int, we transform the preceding (respectively next) utterance representation \(\phi (x_{i-1})\) (respectively, \(\phi (x_{i+1})\)) into the representation of the current utterance \(\phi (x_i)\). The training scheme and loss for ‘interpolating’ from the preceding utterance are:

$$ {f_{\textsc {int,p},i}^{*} \triangleq f_{\textsc {int,p}}^{*}\big (\phi (x_{i})\big ) = W^*_{\textsc {int,p}}\,\phi (x_{i-1}) + b^*_{\textsc {int,p}},} $$

and loss \({\mathcal {L}^*_{\textsc {int,p}, i} = {{\,\textrm{OLS}\,}}{\big (f_{\textsc {int,p},i}^{*} , \phi (x_{i})\big )}.}\)

Similarly, for ext with the preceding (next) utterance as context, we extrapolate from \(x_i\) to \(x_{i-1}\) (\(x_{i+1}\)) using:

$$ f_{\textsc {ext,p},i}^{*} \triangleq {f_{\textsc {ext,p}}^{*}\big (\phi (x_{i})\big ) = W^*_{\textsc {ext,p}}\,\phi (x_i) + b^*_{\textsc {ext,p}},} $$

and loss \({\mathcal {L}^*_{\textsc {ext,p},i} = {{\,\textrm{OLS}\,}}{\big (f_{\textsc {ext,p},i}^{*} , \phi (x_{i-1})\big )}.}\)

Note: the representations for int+ext with the preceding utterance as only context is given by the concatenation \(f_{\textsc {ext,p}}^{*}\big (\phi (x_{i})\big ) \oplus f_{\textsc {int,p}}^{*}\big (\phi (x_{i})\big )\) (and similarly for the next utterance as context).

Except for ext and DSTC2, the results presented in Table 6 clearly underscore the importance of using bidirectional context for learning representations to induce dialogue structures: across all datasets and strategies, using both preceding and subsequent dialogue context (PN) consistently yields higher structure quality compared to using either preceding (P) or subsequent (N) context alone.

Impact of dialogue context width on structure quality

In the previous paragraph, we highlighted the importance of training ellodar with the bidirectional rather than either solely with the preceding or subsequent dialogue context. However, it is worth noting that ellodar uses only the local dialogue context, comprising the preceding and subsequent utterances to efficiently learn representations. Here, we investigate whether using larger (bidirectional) dialogue contexts can yield improved ellodar representations. To explore this, we compare the performance of ellodar’s strategies int, ext, and int+ext trained on larger dialogue contexts against training ellodar with the default local context window. We experiment with two context windows increasingly larger than ellodar’s default local dialogue context window of just 1 preceding and next utterance (PN):

  1. (1)

    The dialogue context consisting of the concatenation of representations of the 2 preceding and 2 subsequent utterances (shortly written as P\(_2\)N\(_2\)), for which we provide the training and loss below for the int strategy.

    $$\begin{aligned} f_{\textsc {int,p}_{2}\textsc {n}_{2},i}^{*}\triangleq & {} f_{\textsc {int,p}_2\textsc {n}_2}^{*}\left( \phi \left( x_i\right) \right) \\= & {} W^*_{\textsc {int,p}_{2}\textsc {n}_{2}}\,\big (\phi (x_{i-2}){\oplus }\phi (x_{i-1}){\oplus }\phi (x_{i+1})\\{} & {} {\oplus }\phi (x_{i+2})\big ) + b^*_{\textsc {int,p}_{2}\textsc {n}_2} , \end{aligned}$$

    with as loss \(\mathcal {L}^{*}_{\textsc {int,p}_2\textsc {n}_2, i} = {{\,\textrm{OLS}\,}}{\big (f_{\textsc {int,p}_2\textsc {n}_2,i}^{*}~, \phi (x_{i})\big )}\).

  2. (2)

    The dialogue context consisting of the concatenation of the average of all preceding and the average of all subsequent utterance representations (P*N*). Note that here we take the mean instead of concatenating all preceding utterances and the mean of all subsequent utterance representations to avoid high-dimensional representations that may prevent efficient clustering. We provide the training and loss below for the int strategy.

    $$ f_{\textsc {int,p}_{*}\textsc {n}_{*},i}^{*} ~\triangleq f_{\textsc {int,p}_{*}\textsc {n}_{*}}^{*} \big (\phi \big (x_i\big )\big )= $$
    $$ W_{\textsc {int,p}_{*}\textsc {n}_{*}}^{*}\,(\underset{j=0,\ldots , i-1}{\text {average}}\phi (x_j) \oplus \underset{k=i+1,\ldots ,N-1}{\text {average}} \phi (x_k)) + b_{\textsc {int,p}_{*}\textsc {n}_{*}}^{*}, $$
    $$ \text {with as loss}\quad \mathcal {L}^*_{\textsc {int,p}_{*}\textsc {n}_{*},i} = {{\,\textrm{OLS}\,}}{\big (f_{\textsc {int,p}_{*}\textsc {n}_{*},i}^{*} \ , \phi (x_{i})\big )}. $$
Table 7 Impact of dialogue context width on structure quality

Table 7 reveals that, for SimDial, there is minimal difference in cluster performance among various dialogue context sizes. However, across all other datasets (excluding DSTC2 and the int(+ext) strategy), the results indicate that the overall best structure quality is achieved when ellodar is trained with the local context window of just a single preceding and next utterance (PN). It consistently outperforms ellodar trained with the full context window P\(_{*}\)N\(_{*}\), and is either better or on par with ellodar trained on P\(_{*}\)N\(_{*}\) as context. This observation is further supported by the fact that the larger the context window, the poorer the cluster performance: the cluster performance for P\(_{*}\)N\(_{*}\) is inferior to that of P\(_{2}\)N\(_{2}\), with the latter slightly underperforming compared to the local dialogue context window PN. These results confirm that using only the local dialogue context for learning ellodar’s representations is a good choice. However, it is worth noting that the capacity of ellodar’s linear vector-to-vector regression is limited. As a result, ellodar may be too constrained to effectively exploit the subtle signals in larger dialogue context. Nevertheless, the observation that ellodar can effectively exploit signals in the local dialogue context alone suggests that there is sufficient signal in this local context to induce dialogue structures. This is particularly noteworthy when compared to more complex variational-based models such as DVRNN and SVRNN, which rely on the entire preceding dialogue context, yet struggle to induce representative dialogue structures.

Table 8 Illustration of common failure modes in ellodar

6.8 Qualitative analysis

While ellodar’s representations can be efficiently learned, its efficiency primarily stems from its linear vector-to-vector regression objective. Yet, linear transformations may be too restrictive to handle complex edge cases, as there is a trade-off between efficiency and the complexity of cases ellodar can model. Hence, to better understand these limitations, we conduct a qualitative analysis of common failure modes of ellodar. We begin by identifying three failure modes, i.e., instances where ellodar’s utterance representations are incorrectly assigned to clusters, and provide examples of each. Next, we present the distribution of these three failure modes by manually categorizing a randomly selected subset of utterances that were erroneously assigned to clusters induced by cellodar into these failure modes.

Identification of common failure modes

To better understand ellodar’s shortcomings, we reveal and analyze common failure modes where utterances are incorrectly assigned to clusters due to ellodar’s learning approach, particularly due to its reliance on local context.

For this, we conduct a qualitative analysis of cellodar-induced clusters based on human annotation. By manually categorizing incorrectly assigned utterances, along with their respective previous and subsequent utterances, we can reveal the most visible failure modes inherent to ellodar’s learning scheme. First, to identify incorrectly assigned utterances, we use the following heuristic: for each cellodar-induced cluster \(\mathcal {C}\), we assign a gold label to \(\mathcal {C}\) which is the most prevalent gold label \(y_{\mathcal {C},\textsc {gold}}\) among all utterances in \(\mathcal {C}\). An utterance \(x_i\) is then erroneously assigned to \(\mathcal {C}\) if its gold label \(y_i\) differs from the most frequently occurring gold label in \(\mathcal {C}\), i.e., \(y_i \ne y_{\mathcal {C},\textsc {gold}}\). Second, to categorize misassigned utterances into failure modes intrinsic to ellodar’s learning scheme, we manually compare each misassigned utterance and its local context window with those of correctly assigned utterances within the same cluster. We consider the following failure modes:

  1. (1)

    PCN: the misassigned utterance \(x_i\) shares semantics with correctly assigned utterances \(x_{j\ne i}\) in cluster \(\mathcal {C}\). However, the preceding utterance \(x_{i-1}\) and subsequent utterance \(x_{i+1}\) differ from the dialogue context of correctly assigned utterances, i.e., \(x_{i+1} \not \approx x_{j+1}\) and \(x_{i-1} \not \approx x_{j-1}\). The example in the upper row of Table 8, illustrates this, where the misassigned utterance “Yes, that sounds great” is equivalent to the correctly assigned “Sounds great”. Yet, their dialogue states (\(y_j\): affirm for correctly assigned \(x_j\); \(y_i\): select for misassigned \(x_i\)) differ due to variations in the semantics of both preceding and subsequent utterances. This mode is intrinsic to ellodar, where: (i) for \(f_{\textsc {int},i}^{*}\), two distinct context windows transform into the same utterance representation, i.e., \(\phi (\)“Sounds great”) \(\approx \) \(\phi (\)“Yes, that sounds great”), and (ii) for \(f_{\textsc {ext},i}^{*}\), the same input utterance representation \(\phi (\)“Sounds great”) \(\approx \) \(\phi (\)“Yes, that sounds great”) may transform based into the context representation most frequently associated with this input (e.g., that of the correctly assigned utterances).

  2. (2)

    PCN: the misassigned utterance \(x_i\) lacks shared semantics with correctly assigned utterances \(x_{j \ne i}\) in \(\mathcal {C}\). However, both the preceding and subsequent utterances share semantics among misassigned and correctly assigned utterances, i.e., \(x_{i+1} \approx x_{j+1}\) and \(x_{i-1} \approx x_{j-1}\). In the middle part of Table 8, \(x_i\) is more specific than \(x_j\) as it not only informs about the number of beds but also mentions allowing pets. Note that the reverse, where \(x_j\) is more specific than \(x_i\), can also occur. This mode is intrinsic to ellodar, where: (i) for \(f_{\textsc {int},i}^{*}\), two equivalent context representations may transform into the utterance representation most frequently surrounded by that context, i.e., \(\phi (x_j)\), and (ii) for \(f_{\textsc {ext},i}^{*}\), semantically different input utterances transform into the same context representation.

  3. (3)

    PCN: the only shared semantics among the correctly assigned utterances \(x_{j\ne i}\) and misassigned utterance \(x_i\) are those of the subsequent utterances, i.e., \(x_{i+1} \approx x_{j+1}\). Illustrated in the bottom part of Table 8, akin to the example for PCN, the semantics of \(x_i\) and \(x_j\) are similar, but \(x_j\) is more specific as it also requests the event name aside from the city. This mode is intrinsic to ellodar for similar reasons as the PCN-mode, with the difference that the subsequent utterance, whose semantics are shared among \(x_i\) and \(x_j\), has a more larger effect on the final representation compared to the previous utterance that does not share the same semantics.

Note that our list of three failure modes is non-exhaustive, i.e., the following failure modes may also occur: PCN, PCN, and PCN. However, as we found these failure modes not to occur in our randomly selected subset of 120 erroneously assigned utterances (as described below), we opted not to include them here. Aside from the presented failure modes inherent to ellodar, there are also failure modes not related to ellodar but inherent to k-means clustering itself, such as, e.g., outliers. For the instances where utterances cannot be categorized into one of the three presented failure modes, we include an extra “other” mode.

The distribution of common failure modes

To better understand the frequency with which each of the three identified failure modes occur, we randomly sampled and manually annotated 20 (10 user and 10 system) misassigned utterances of cellodar- induced clusters for each SGD domain (Events, Homes, Music, and Movies), CamRest676, and DSTC2 for a total of 120 utterances. Note that SimDial is excluded from this analysis, as cellodar almost perfectly recovers its underlying gold structures.

Table 9 illustrates the distribution of error modes in the SGD, CamRest676, and DSTC2 datasets. Overall, PCN is the most frequently occurring failure mode, with other types of errors occurring less frequently. The results for PCN suggest that ellodar faces difficulties handling the edge case in which the surrounding local dialogue context is shared between two utterances that are semantically different (i.e., with different underlying gold dialogue states). ellodar cannot resolve this failure case well due to its sole reliance on local context and linear transformations. Therefore, future work could explore trading off efficiency by increasing complexity, e.g., by using non-linear transformations and/or more effectively exploiting subtler cues in larger dialogue contexts.

Table 9 The distribution of common failure modes

6.9 Limitations

Application domain

First, ellodar is designed specifically for clustering dialogue utterances, using the context of both preceding and subsequent utterances to produce contextual representations by fine-tuning a pretrained encoder. This means it cannot be used for task-oriented downstream tasks that only utilize preceding dialogue, such as intent classification and response generation.

Second, our work focuses on inducing dialogue structures at the utterance level (assigning utterances to states) and thus cannot be straightforwardly applied to the task of recovering dialogue structures based on slot type induction (assigning words or subphrases to states) as in [42].

Third, our specific focus was on extracting dialogue structures from task-oriented dialogues. Task-oriented dialogues typically involve two parties who exchange utterances consecutively. Therefore, we did not conduct experiments where we recover structures from dialogues with multiple consecutive user or system utterances, nor on multi-party dialogues (where more than two actors can appear in a single dialogue).

Finally, our work focuses on inducing dialogue structures from text only. However, in order to better recover structures, an interesting and unexplored direction for future work would be to consider a multi-modal setting where dialogues are augmented with other modalities, such as images.

Reliance on the ground truth number of dialogue states

The main presented results rely on initializing the number of clusters of all considered models with the ground truth number of dialogue states. In practice, however, the ground truth number of states is unknown and thus would need to be estimated by inspecting a subset of the available dialogues. To assess the impact of not correctly setting the ground truth number of states, Section 6.5 analyzes the impact of overestimating the number of ground truth states by a factor of two, demonstrating that our proposed methods induce relatively pure clusters and still outperform both joint methods. An interesting direction for future work would thus be to investigate cluster algorithms that do not require the number of dialogue states as input such as, e.g., DBSCAN [64], Mean shift [65], and Affinity propagation [66].

Dialogue context representation strategy

To include local context, we clustered the concatenation of the considered utterance’s representation and its adjacent utterances’ representations, rather than leveraging more advanced techniques that integrate different views of data such as, e.g., Multi-View k-means [67]. We leave the latter for future work.

Training time performance analysis

The training time performance discussion Section 6.6 involved comparing the training times of state-of-the-art joint models to those of our approaches. Because training time is affected by factors like implementation, batch size, etc. the reported times should be interpreted as an indication rather than as exact numbers.

Pre-training sentence encoders

Because our work focuses on computational efficiency, we did not further experiment with specifically pretraining sentence encoders on each distinct domain or dataset. However, the preliminary results of TOD-BERT on SGD, discussed in Section 6.4, suggest that such specific pretraining might be beneficial. It is further worth noting that the effectiveness and efficiency of ellodar’s linear-to-linear regression is in part attributed to ellodar building upon an out-of-the-box pre-trained sentence encoder that already has undergone substantial computation efforts: training such encoders from scratch is computationally expensive to obtain.

Generalizability to additional human-human dialogues

While compared to prior DSI works, our experiments cover a broader range of datasets, i.e., 4 commonly used conversational datasets (DSTC2, CamRest676, SimDial, and SGD), it is worth noting that SimDial comprises synthetic dialogues, SGD and DSTC2 human-machine dialogues, and CamRest676 human-human dialogues. As such, there remains uncertainty about the generalizability of cellodar to human-human dialogues other than CamRest676. Unfortunately, due to the lack of utterance-level annotated conversational datasets (as opposed to slot extraction datasets, e.g., MultiWOZ [51]), we were unable to cover additional datasets, and defer exploring this to future work.

7 Implications of the presented research results

The findings in this work have implications on the relatively underexplored DSI domain. Our main goal was to design an efficient DSI model, which we argued to be essential in practical settings, e.g., when users need to run the DSI model multiple times with different numbers of dialogue states to recover the optimal structure. By revisiting and further developing the cluster-based method of [18, 19], we demonstrated that simple DSI models can be orders of magnitude faster yet still outperform more complex existing models. Therefore, we want to emphasize that pragmatic architectural choices, which may not necessarily follow the trend of aiming for performance gains through more complex/advanced (neural) models, may lead to both efficiency and performance improvements over more complex models. We hope that this will encourage the community to pursue model efficiency as an important design aspect, besides model effectiveness.

Second, as no publicly available framework for benchmarking DSI models currently exists, we release our modified datasets and evaluation to accelerate future DSI research, for which we hope that our simple cellodar approach will serve as a strong baseline.

8 Conclusions

Unlike recently proposed DSI models that jointly learn to encode and cluster utterances, we revisited an efficient cluster-based approach that proceeds in two steps. It first encodes utterances as vectors, after which it clusters the obtained representations to induce the dialogue structure in the second step. However, the previously proposed cluster-based approach encodes utterances as bag-of-words or skip-thought vectors without using dialogue context. Hence, we proposed to adopt more powerful transformer-based sentence encoders and contributed ellodar, a highly efficient approach for learning dialogue aware representations. ellodar trains linear transformations with a vector-to-vector regression objective in the encoding space of a frozen sentence encoder using a local context window. Extensive experiments on representative DSI datasets show that: (i) the cluster-based approach outperforms the recent joint models when using transformer-based encoders to represent utterances, (ii) clustering ellodar’s representations further improves performance consistently, while being orders of magnitude faster than the joint models. We release our datasets (which are variants of commonly adopted DSI datasets), evaluation, and models as a common benchmark for DSI, which is currently missing.