Introduction

Nowadays, interactive conversational intelligent assistants are developing rapidly, and people expect to obtain high-quality recommendations by engaging in natural language conversations with these intelligent assistants, which conversational recommender systems (CRS) [1, 2] are dedicated to doing. With the development of neural networks, CRS has been widely used in e-commerce, news, and other fields[3,4,5,6,7,8], demonstrating its value and drawing many researchers’ interest.

Unlike traditional one-time recommender systems [9, 10], CRS interactively acquires user preferences through fewer conversation rounds to help users query information or accomplish specific tasks. Generally, CRS consists of two modules, the recommendation module and the conversation module [11, 12], the conversation module is responsible for the natural language conversation with users; the recommendation module is responsible for dynamically capturing user preferences and suggesting suitable items (movies). An example of CRS-user interaction is illustrated in Fig. 1. The process ends when the user accepts recommendations or exits the system. This interactive approach makes CRS more flexible and personalized, better meeting user needs.

Fig. 1
figure 1

An example of movie recommendation conversation between user and system. Items and tags are marked in pink and yellow, respectively

In CRS, accurately capturing user preferences for recommendations is a challenging task, because it requires accurately capturing user preferences within relatively short multi-turn dialogues, however, the limited information contained in a few dialogues also increases the difficulty of the task. Since the number of items appearing in history conversations is very few, which causes the sparsity of the number of items, existing research mitigates this problem by introducing external knowledge (e.g., knowledge graphs, reviews, introductions, etc.) [1, 12,13,14], but some difficulties still remain.

In existing CRS the knowledge graph used is a subgraph of DBpedia, which ignores some information. Therefore, we introduced tags and constructed a new movie knowledge graph with tags. Tags are descriptions of the characteristics of items, generally condensed into phrases, they represent subjective evaluations generated by most people for a thing and can play a crucial role in providing recommendations to users.

However, due to data heterogeneity, directly utilizing external information from multiple sources to enhance CRS is difficult. This is because external information differs in content ( a tag describes characteristics vs. a director is a person), and there is a natural information gap between them. Therefore, we must devise a method to bridge information gaps across different data sources. Besides, each tag varies in popularity among the general public (i.e., how many times the tag appears in all items), making it difficult for the system to effectively distinguish between popular and less popular tags. Popular tags are more likely to align with user preferences. Hence, we need to devise a method to address the issue of the effect of popularity on user preferences.

Furthermore, the responses generated by the system lack descriptiveness and diversity when making recommendations. As shown in Fig. 1, the user requests recommendations similar to "Interstellar", which includes tags such as "science fiction drama films", "space adventure films", etc., and the system’s recommended items, "The Martian" and "Passengers", both match these tags. Thus, this recommendation better aligns with the user preferences, and the descriptiveness and diversity of the recommended items by the system can also be improved using tag information.

Therefore, we propose a new model, Multi-source Information Contrastive Learning Collaborative Augmented Conversational Recommender Systems(\(\textbf{MCCA}\)). Its core idea is to fully utilize multi-source content to improve the overall performance of the CRS. We first construct a new Movie Tag Knowledge Graph (\(\textbf{MTKG}\)), which contains association information between movies and their corresponding tags. Meanwhile, we combined DBpedia. To bridge the information gaps between the items in the two knowledge graphs, we devise a Multi-source Item Fusion mechanism (\(\textbf{MIF}\)). To fully mine the connection between different knowledge graphs and optimize the features of items that have been fused by MIF, we perform unsupervised contrastive learning on the fused item features separately with the item features from different knowledge graphs. Furthermore, we devise a Multi-Tag Fusion mechanism (\(\textbf{MTF}\)) to address the problem of different tag popularity, we obtain the keywords from the reviews of the corresponding items by the unsupervised keywords extraction algorithm YAKE [15] as auxiliary information to enhance the popularity of tag information. Thereby, we not only supplemented the item information but also enhanced the expression of the users’ preferences. Using the fused items and tags in the conversation module can also help to make the generated responses richer. Extensive experiments based on baseline datasets indicate that MCCA outperforms the current state-of-the-art CRS models in both recommendation accuracy and response quality.

The contributions of this paper can be summarized as follows:

  1. (1)

    We construct a tag-based knowledge graph to introduce tag information for user preference representation.

  2. (2)

    We devise a Multi-source Item Fusion mechanism to bridge the information gaps between items from different sources and utilize unsupervised contrastive learning to optimize the fused item features.

  3. (3)

    We devise a Multi-tag Fusion mechanism to address the effect of tag popularity differences on user preferences.

  4. (4)

    Using tag information in the conversation module to improve the descriptiveness and diversity of dialogues.

Related work

In recent years with the rapid development of conversational recommender systems [12, 14, 16], it has become a hot topic to have interactive conversations with users and dynamically obtain user intentions and preferences. CRS aims to talk with users and provide high-quality recommendations through natural language.

CRS

Current research in CRS can be categorized into two types: attribute-based and open-ended. Our research is based on open-ended CRS.

Attribute-based CRS focuses on asking users preference questions about item attributes to make recommendations [17,18,19]. This type of CRS relies on pre-defined rules (e.g., slot filling [20]) to interact with the user, and it focuses on completing recommendations in as few dialogue rounds as possible. Although such systems are easy to implement, they do not emphasize generating human-like natural language responses and therefore have a poor user experience.

Another type is open-ended CRS, which learns user preferences from the original dialogue and then generates responses resembling human dialogue by combining recommended content [21,22,23]. Due to the sparsity of data, most existing methods help CRS understand dialogues and capture user preferences by incorporating external knowledge. These external knowledge include entity-level knowledge graph [1, 17, 24, 25], word-level knowledge graph [13], reviews [14], and introductions [12]. To utilize this external knowledge, researchers propose to align different semantic spaces in the two knowledge graphs using mutual information maximization [13], and for external knowledge with different structures, researchers propose a contrastive learning framework ranging from coarse-grained to fine-grained [16] to further improve semantic fusion. Although these methods have improved the performance of CRS to some extent, it is still a challenge to utilize external knowledge more effectively.

Contrastive learning

Contrastive learning has been widely used computer vision [26] and information retrieval [27], demonstrating good results in both fields. It usually relies on data enhancement strategies such as image rotation, random cropping, etc. to generate a set of relevant positive samples for learning and negative samples from the dataset using random sampling. In the field of natural language processing, contrastive learning can be used to better align and ensure consistency in the semantic space [28], It can also be used to fuse a variety of information, such as knowledge graphs and text [16], text and images [29,30,31]. It enables different information to complement each other and fully exploit the potential of data.

Fig. 2
figure 2

Overview of the MCCA model. Bolded fonts in the context represent items. The yellow parts represent tag information or operations on tags, the green parts represent item information or operations on it in MTKG, and the blue parts represent item information or operations on it in DBpedia. MIF stands for Multi-source Item Fusion mechanism, and MTF stands for Multi-Tag Fusion mechanism

Problem formulation

Open-ended CRS consists of recommendation module and conversation module, Formally, we denote a user by \( u \!\in \!U \), for dialogue, we use \( H\!=\!\{ {s_k}\}_{k=1}^n \) to denote the list of utterances consisted of dialogue that has taken place between the user and the system, where \({s_k}\) represents the utterance in the k-th turn of the dialogue, generated by either the user or the system. \(\mathcal{I}\) denotes the whole set of items. For the knowledge graph DBpedia (\(\mathcal{D}\)), each knowledge fact is formatted as \(\left\langle {{e_1},{r_d},{e_2}} \right\rangle \), \({e_1},{e_2} \in \mathcal{E}\) denote entities, and \({r_d} \in \mathcal{R}{_\mathcal{D}}\) denotes the relationship between them. Given the dialogue history H, the goal of the recommendation module is to accurately capture the user’s preference and generate a subset of candidate items \({\mathcal{I}_{k + 1}} \in \mathcal{I}\) that satisfy this preference. Sometimes, \({\mathcal{I}_{k + 1}}\) may be \(\varnothing \). The purpose of the conversation module is to generate utterance \({s_{k + 1}}\) as a response, which may contain recommended items, if \({\mathcal{I}_{k + 1}}\) is \(\varnothing \), \({s_{k + 1}}\) is a chit-chat utterance. When the user accepts or exits the system, the whole process is over.

Methodology

In this section, we propose a MCCA method, which is overviewed in Fig. 2. There are four parts: multi-source information data, multi-source information fusion, conversation module, and recommendation module. In the multi-source information data part, we first utilize the newly constructed Movie Knowledge Graph, called MTKG (Sect. “MTKG composition”), to augment the user’s representation of preferences on tags as well as the representation of items. Then, we retrieve all the tag information and review information about items that have been mentioned in the dialogue from the movie information database (Sect. “Multi-source information acquisition”). Next, we design a multi-source information fusion module (Sect. “Multi-source information fusion augmented by unsupervised contrastive learning”), which includes a fusion mechanism for multi-source item representations (Sects. “Multi-source item fusion mechanism”,“Item-level unsupervised contrastive learning”) and a multi-tag fusion mechanism (Sect. “Multi-tag fusion mechanism”)to enhance user representations.

Multi-source information data

MTKG composition

In past research, the recommendation module utilized only items that had been mentioned in a dialogue to represent user preferences. However, since only a few items appeared in the dialogue, this could lead to inaccuracies in capturing user preferences. In our approach, we introduce tags as external information which can highly summarize the main features of an item through phrases, thereby more accurately representing the characteristics of the items appearing in the dialogue, enabling a more precise capture of user preferences.

We introduced the existing knowledge graph DBpedia, in which the relationship between items and their tags was not stored. Therefore, we constructed a new movie knowledge graph with tag information.

Based on information about movie tags in Wikipedia, we collected tag information for all items that appeared in the dataset. A higher frequency of occurrence of a tag among all the tag information indicates a higher popularity of this tag among the people, and each movie corresponds to several tags, each of which is a phrase consisting of a few words, and these tags accurately describe the characteristic of the corresponding movie item. After obtaining the tag information for the items, we deleted some information that did not have special meanings (e.g., shot in 1988) and the tag information that appeared only once. All the items and corresponding tags collected after organization are saved in the database \({M_{db}}\), where we also store the collected tags’ popularity information and the reviews of the items.

After processing the tag information, we made it into a new Movie-Tag Knowledge Graph called MTKG(\(\mathcal{G}\)), In this knowledge graph, the relationship nodes between item entity nodes in this knowledge graph are tag information.

Multi-source information-aware item representation

We used the knowledge graph DBpedia to obtain the basic features representation of each item. We follow the existing approach of using R-GCN [14, 32] to encode each entity e in DBpedia(\(\mathcal{D}\)). Formally, for each entity e in \(\mathcal{D}\), its representation is computed by Eq. (1).

$$\begin{aligned} \begin{aligned} d_e^{\ell +1}=\sigma \left( \sum _{r \in \mathcal {R}_\mathcal{D}} \sum _{e \in \mathcal {E}_e^r} \frac{1}{Z_{e, r}} W_{d, r}^{\ell } h_e^{\ell }+W_d^{\ell } h_e^{\ell }\right) \end{aligned} \end{aligned}$$
(1)

where \(h_e^\ell \in \mathbb R{^d}\) is the entity representation of e at \(\ell \)-th layer, d denotes the feature embedding dimension, \(\sigma \left( \cdot \right) \) denotes the ReLU activation function, \(\mathcal{E}_e^r\) denotes the set of neighboring nodes of e under the r relation, \(W_{d,r}^\ell ,W_d^\ell \) are the learnable matrixs, and \({Z_{e,r}}\) is the normalization factor. According to Eq. (1), each node can aggregate information from different entity nodes through messaging, and we use its output \({D_\mathcal{D}} = \left\{ {{d_{e,1}},{d_{e,2}}, \cdots ,{d_{e,m}}} \right\} \) as the foundational representation for items, where m denotes the number of items.

To obtain embeddings of each item and tag in the knowledge graph MTKG(\(\mathcal{G}\)), we use another R-GCN to encode them.

$$\begin{aligned} \begin{aligned} g_i^{\ell +1}=\sigma \left( \sum _{r \in \mathcal {R}_\mathcal{G}} \sum _{j \in \mathcal {G}_i^r} \frac{1}{Z_{i, r}} W_{g, r}^{\ell } h_j^{\ell }+W_g^{\ell } h_i^{\ell }\right) \end{aligned} \end{aligned}$$
(2)

where \(g_i^\ell \in \mathbb R{^d}\) represents the node’s representation at layer \(\ell \)-th, where d is the embedding dimension, \(\sigma \left( \cdot \right) \) is the ReLU activation function, \(\mathcal{G}_i^r\) is the set of neighboring nodes of the node under the relation r, \(W_{g,r}^\ell ,W_g^\ell \) are learnable matrixes, and \({Z_{i,r}}\) is the normalization factor. Similarly, according to Eq. (2) we can obtain the embedding representation in MTKG, and we take its output \({D_\mathcal{G}} = \left\{ {{g_{i,1}},{g_{i,2}}, \cdots ,{g_{i,n}}} \right\} \) as the base representation and n is the number of nodes.

Multi-source information acquisition

First, extract the items \({E_h}\) from the the dialogue history H. Extraction using a character matching model, the process of extracting the items is shown in Eq. (3).

$$\begin{aligned} \begin{aligned} {E_h} = Extract\left( H \right) \end{aligned} \end{aligned}$$
(3)

where \(Extract\left( \cdot \right) \) denotes the extract operation.Then, all the tags, the tags’ popularity, and reviews corresponding to the items are retrieved from the movie information database \({M_{db}}\) based on the items appearing in the dialogue history H. The retrieval is calculated as shown in Eq. (4).

$$\begin{aligned} \begin{aligned} {T_{ment}},{T_{pop}},R = Retrieve\left( {{E_h},{M_{db}}} \right) \end{aligned} \end{aligned}$$
(4)

where \(Retrieve\left( \cdot \right) \) denotes the retrieval operation, \({T_{ment}}\) denotes a set of retrieved tags, \({T_{pop}}\) denotes a set of retrieved tags’ popularity, and R denotes a set of retrieved reviews.

After obtaining the information above, querying the embedding dictionary \({D_\mathcal{D}}\) of the knowledge graph DBpedia obtained in Sect. (“Multi-source information-aware item representation”) obtains the items’ embedding set \({E_d}\); querying the embedding dictionary \({D_\mathcal{G}}\) of the MTKG obtains the items’ embedding set \({E_g}\) and the tags’ embedding set \({E_t}\), and then the items’ embedding vectors are respectively spliced to form the items’ feature embedding matrixes \({E_\mathcal{D}} \!\in \!\mathbb R{^{a \times d}}\) and \({E_\mathcal{G}} \!\in \! \mathbb R{^{a \times d}}\), similarly, the tags’ embedding vectors are spliced to form the tags’ feature embedding matrix \({E_\mathcal{T}} \!\in \! \mathbb R{^{b \times d}}\), where a and b denote the number of items and tags, respectively.

Multi-source information fusion augmented by unsupervised contrastive learning

Multi-source item fusion mechanism

In Sect. (“Multi-source information acquisition”), we obtained the item embedding matrices \({E_\mathcal{D}}\) and \({E_\mathcal{G}}\). For a certain item obtained from different knowledge graphs, they represent the same item in terms of information (e.g., "Inception" in DBpedia vs. "Inception" in MTKG), simply concatenating the embeddings from two different sources does not fully utilize the potential of the data.

In order to bridge the information gaps exhibited between items from different knowledge graphs, we designed a Multi-source Item Fusion mechanism (MIF) to solve this problem. MIF is used to align informationally highly correlated two item embeddings and fuse them. The core idea is to utilize the similarity between two coupled variables to solve the problem of information gaps, which is calculated as follows in Eq. (5).

$$\begin{aligned} \begin{aligned} sim\left( {{E_{{\mathcal{D}_i}}},{E_{{\mathcal{G}_j}}}} \right) = \sigma \left( {{W_2} \cdot \sigma \left( {{W_1} \cdot \left[ {{E_{{\mathcal{D}_i}}};{E_{{\mathcal{G}_j}}}} \right] + {b_1}} \right) + {b_2}} \right) \end{aligned} \end{aligned}$$
(5)

where \(sim\left( \cdot \right) \) represents the similarity score between two embeddings of a certain item with consistent informativeness from different sources, \({E_{{\mathcal{D}_i}}},{E_{{\mathcal{G}_j}}}\) represent two embeddings of an item from two knowledge graphs with the same informativeness, respectively. \({W_1}\) and \({W_2}\) are matrices of learnable parameters, \(\left[ ; \right] \) denotes the concatenation operation, and \({b_1},{b_2}\) are the two bias terms. \(\sigma \left( \cdot \right) \) denotes the ReLU activation function. After obtaining the similarity, we fused the item embeddings to obtain the final item embedding \({E_{fusion}}\), with the following formula:

$$\begin{aligned} \begin{aligned} \begin{array}{l} {s_{i,j}} = \frac{{exp\left( {sim\left( {{E_{{\mathcal{D}_i}}},{E_{{\mathcal{G}_j}}}} \right) } \right) }}{{\sum \nolimits _{k = 1}^n {exp} \left( {sim\left( {{E_{{\mathcal{D}_i}}},{E_{{\mathcal{G}_j}}}} \right) } \right) }}\\ {E_{fusion}} = \sum \limits _{i = 1}^n {\sum \limits _{j = 1}^n {{s_{i,j}}} \cdot {E_{{\mathcal{D}_i}}}} + \sum \limits _{i = 1}^n {\sum \limits _{j = 1}^n {{s_{i,j}}} \cdot {E_{{\mathcal{G}_j}}}} \end{array} \end{aligned} \end{aligned}$$
(6)

We apply softmax to obtain the attention weights \({s_{i,j}}\). Finally, each item feature embedding \({E_{{\mathcal{D}_i}}}\) and \({E_{{\mathcal{G}_j}}}\) are weighted and fused with the attention weights to obtain the fused item feature embedding \({E_{fusion}}\).

Item-level unsupervised contrastive learning

A better embedding of fused item features was obtained through the MIF module, but the connection between the fused item features and the item features in each information source was not fully mined. We repeatedly perform mutual information maximization between the input features of each source item and the fused item features to optimize the fusion network from each source item to the fused item features. In the above sections, we have obtained item features representations \({E_\mathcal{D}},{E_\mathcal{G}}\) from different sources and fused item features representation \({E_{fusion}}\). But there is a lack of mining the connection between the fused item features \({E_{fusion}}\) and each information source \({E_x},x \in \left\{ {\mathcal{D},\mathcal{G}} \right\} \). We measure the connection between them by using a score function \(Score\left( \cdot \right) \) with normalized predictions and true vectors according to the operation of Vinyals et al. [33].

$$\begin{aligned} \begin{aligned} \begin{array}{l} Score({E_x},{E_{fusion}}) = exp({\overline{E} _x}{({{\overline{G}}_\varphi }({E_{fusion}}))^T})\\ {{\overline{G}}_\varphi }({E_{fusion}}) = \frac{{{G_\varphi }({E_{fusion}})}}{{{{\left\| {{G_\varphi }({E_{fusion}})} \right\| }_2}}},{\overline{E} _x} = \frac{{{E_x}}}{{{{\left\| {{E_x}} \right\| }_2}}} \end{array} \end{aligned} \end{aligned}$$
(7)

where \({\overline{G}_\varphi }\) is a neural network with parameter \(\varphi \) that generates a prediction for \({E_x}\) from \({E_{fusion}}\), and \({\left\| \cdot \right\| _2}\) is the Euclidean norm. We consider the other source features of this information in the same batch as negative samples, thus yielding the loss between the different source item features and the fused item features as shown in Eq. (8).

$$\begin{aligned} \begin{aligned} \mathcal {L}\left( E_{fusion }, E_x\right) =-\mathbb {E}_s\left[ \log \frac{{\text {Score}}\left( E_{fusion}, e_x^i\right) }{\sum _{e_x^j \in E_x} {\text {Score}}\left( E_{fusion }, e_x^j\right) }\right] \end{aligned} \end{aligned}$$
(8)

The final loss for item-level unsupervised contrastive learning is obtained in Eq. (9).

$$\begin{aligned} \begin{aligned} {\mathcal{L}_{CE}} = \mathcal{L}({E_{fusion}},{E_\mathcal{D}}) + \mathcal{L}({E_{fusion}},{E_\mathcal{G}}) \end{aligned} \end{aligned}$$
(9)

Through item-level unsupervised contrastive learning, we can optimize the fusion parameters and obtain fused item features \({E_{fusion}}\) that better fuse multi-source item features.

Multi-tag fusion mechanism

We enhance the representation of user preferences from an item perspective, and at the same time, we further enhance the user preference representations by using user-perceived information to better capture user preferences. In real life, things that appear more frequently tend to be more popular. Similarly, items corresponding to tags that appear more frequently should have more audience, that is, the more popular this tag information is the more popular the item with this tag is. In our collected data, each item is described by multiple tags, but the popularity of each tag is different, which results in different tags having different degrees of importance. In order to emphasize the popularity of different tags, we also utilized keywords extracted from the reviews to assist in enhancing the popularity of tag information. We designed a Multi-Tag Fusion mechanism (MTF) to handle the problem of the different popularity of tags.

The tag embedding representation \({E_\mathcal{T}}\) for all items that have appeared in the dialogue history is obtained in Sect. (“Multi-source information acquisition”). We use the keywords extracted by YAKE [15] from the reviews corresponding to all the items, and we want to obtain keywords that are more relevant to our dialogue, so we calculate the similarity scores between the keywords and the dialogue, we select the keywords with the highest similarity score as the auxiliary enhancement information to the tag information.

$$\begin{aligned} \begin{aligned} \begin{array}{l} K = Top\!-\!1\left( {Y\!A\!K\!E\!\left( R \right) \!,H} \right) \\ {T_p} = Top\!-\!4\left( {{E_\mathcal{T}},{T_{pop}}} \right) \end{array} \end{aligned} \end{aligned}$$
(10)

where \(Top\!-\!1\left( \cdot \right) \) denotes the select operation with the highest similarity score, \(Y\!A\!K\!E\! \left( \cdot \right) \) denotes the YAKE approach, K is the selected keywords, \(Top\!-\!4\left( \cdot \right) \) denotes the operation of selecting the four tags with the highest popularity among all the tags, and \({T_p}\) denotes the final tag feature embedding.

For the obtained keyword K is encoded using a Transformer encoder [34] to obtain \(\mathcal{K}\), which is then fused with the tag feature to get the final tag feature \({E_{Tfusion}}\). The calculation process is shown in Eq. (11).

$$\begin{aligned} \begin{aligned} \begin{array}{l} Po{p_i} = \frac{{exp\left( {\frac{{{T_{{p_j}}} \cdot \mathcal{K}}}{{\left\| {{T_{{p_j}}}} \right\| \left\| \mathcal{K} \right\| }}} \right) }}{{\sum \nolimits _{j = 1}^n {exp\left( {{T_{{p_j}}}} \right) } }}\\ {E_{T\!fusion}} = \sum \limits _{i = 1}^n {\left( {Po{p_i} \cdot {T_{{p_i}}}} \right) } \end{array} \end{aligned} \end{aligned}$$
(11)

where \(Po{p_i}\) denotes the final tag popularity and \({E_{T\!fusion}}\) denotes the final tag feature embedding after fusion with tag feature embedding \({T_{{p_i}}}\).

Multi-source information collaborative augmented recommendation module

Past research on the recommendation module has mainly used items that have appeared in history dialogue to represent user preferences, but user preferences captured through such an approach are often inaccurate because the number of items that can appear in history dialogue is very few, which results in sparsity in the number of items and hence inaccuracy in capturing user preference.

The fused item feature embedding \({E_{fusion}}\) and tag feature embedding \({E_{T\!fusion}}\) obtained above are applied self-attention mechanism to aggregate the item feature embedding and tag feature embedding, respectively, and then these two are finally fused using the gating mechanism to obtain the user preference embedding. The specific calculation process is shown in Eq.  (12).

$$\begin{aligned} \begin{aligned} \begin{array}{l} {u^{\left( e \right) }} = S\!A\!\left( {{E_{fusion}}} \right) \\ {u^{\left( t \right) }} = S\!A\!\left( {{E_{T\!fusion}}} \right) \\ {u^{\left( {et} \right) }} = \beta \cdot {u^{\left( e \right) }} + \left( {1 - \beta } \right) \cdot {u^{\left( t \right) }} \end{array} \end{aligned} \end{aligned}$$
(12)

where \(S\!A\left( \cdot \right) \) denotes the self-attention aggregation operation, \({u^{\left( e \right) }},{u^{\left( t \right) }}\) denote the preference embedding representation of user preference at item level and tag level, respectively, and \(\beta \) denotes the gating probability. Ultimately, based on the obtained user preference embedding \({u^{\left( {et} \right) }}\), we can compute the probability \({P_{rec}}\left( j \right) \) of recommending item j to user u from the item set \(\mathcal{I}\).

$$\begin{aligned} \begin{aligned} {P_{rec}}\left( j \right) = softmax\left( {M\!L\!P\!\left( {{u^{\left( {et} \right) }}} \right) } \right) \end{aligned} \end{aligned}$$
(13)

where \(M\!L\!P\!\left( \cdot \right) \) denotes the multi-layer perceptron. Based on existing work, we learn the model parameters with cross-entropy loss and unsupervised contrastive learning loss as optimization objectives.

$$\begin{aligned} \begin{aligned} \begin{array}{l} {\mathcal{L}_C} = - \sum \limits _{i = 1}^N {\sum \limits _{j = 1}^M {{y_{ij}}} } \cdot log\left( {{{P}}_{rec}^i\left( j \right) } \right) \\ \mathcal{L} = {\mathcal{L}_C} + \omega {\mathcal{L}_{CE}} \end{array} \end{aligned} \end{aligned}$$
(14)

where \({\mathcal{L}_C}\) is the cross-entropy loss, \({\mathcal{L}_{CE}}\) is the unsupervised contrastive learning loss, N is the total number of conversations, i is the current conversation index, M is the number of items, and j is the index of an item, \(\mathcal{L}\) is the sum of cross-entropy loss and unsupervised contrastive learning loss. \(\omega \) is the weight of unsupervised contrastive learning loss.

Multi-source information collaborative augmented conversation module

The multi-source information collaborative augmented conversation module can not only generates chit-chat to explore user preferences but can also generates statements containing recommended items to recommend to the user. To improve the descriptiveness and diversity of the system-generated dialogue, tag information is introduced for further description of the recommended items.

We adopt the widely used language generation model Transformer [34] as the backbone model for response generation, which has shown excellent results in many natural language processing tasks [35,36,37,38]. We first encode the dialogue history H using the Transformer encoder to obtain a hidden representation \(\mathcal{H}\) of the dialogue history. Then, we employ a transformer decoder with multi-head attention mechanism to gradually fuse context information, tag feature embedding and item feature embedding. The process is shown in Eq. (15).

$$\begin{aligned} \begin{aligned} \begin{array}{l} A_0^n = M\!H\!A\!\left( {{Y^{n - 1}},{Y^{n - 1}},{Y^{n - 1}}} \right) \\ A_1^n = M\!H\!A\!\left( {A_0^n,\mathcal{H},\mathcal{H}} \right) \\ A_2^n = M\!H\!A\!\left( {A_1^n,{E_{T\!fusion}},{E_{T\!fusion}}} \right) \\ A_3^n = M\!H\!A\!\left( {A_2^n,{E_{fusion}},{E_{fusion}}} \right) \\ {Y^n} = F\!F\!N\!\left( {A_3^n} \right) \end{array} \end{aligned} \end{aligned}$$
(15)

where \({Y^{n - 1}}\) represents the output of the decoder at time step \(n\!-\!1\), \({E_{fusion}},{E_{T\!fusion}}\) are the item feature embedding and tag feature embedding from the multi-source information fusion module, respectively, and \(A_0^n,A_1^n,A_2^n,A_3^n\) represent the embedding that is output after the self-attention layer and the multi-information cross-attention layer, respectively. \(M\!H\!A\!\left( {Q,\!K,\!V} \right) \) denotes the multi-head attention function [34], which was calculated as follows in Eq. (16).

$$\begin{aligned} \begin{aligned} \begin{array}{l} M\!H\!A\!\left( {Q,\!K,\!V} \right) = [{h_1};...;{h_h}]{W^o}\\ {h_i} = Attention\left( {QW_i^q,KW_i^k,VW_i^v} \right) \end{array} \end{aligned} \end{aligned}$$
(16)

where Q query matrix, key matrix K and value matrix V are the inputs, h is the number of heads and \({W_i}\) is the parameter matrix. \(F\!F\!N\!\left( \cdot \right) \) is a fully connected feedforward neural network, consisting of two linear layers with ReLU activation function.

$$\begin{aligned} \begin{aligned} FFN\left( x \right) = ReLU\left( {x{W_3} + {b_3}} \right) {W_4} + {b_4} \end{aligned} \end{aligned}$$
(17)

where \({W_3},{W_4}\) are learnable parameters and \({b_3},{b_4}\) are two biases.

In order to generate dialog utterances, the decoder output \({Y_n}\) also needs to predict the distribution of words through the softmax operation. The CRS wants to incorporate the relevant content of the recommendation module when generating the response, as well as to provide a general description of the recommended items. The copy mechanism [39] achieves the above purposes and increases the information content and richness of the response. For a given generated sequence \(\left\{ {{y_{i - 1}}} \right\} = {y_1},{y_2},...,{y_{i - 1}}\), the generate next marker \({y_i}\) probability is shown in Eq. (18)

$$\begin{aligned} \begin{aligned} \begin{array}{l} Pr\left( {{y_i}|\{ {y_{i - 1}}\} } \right) = P{r_1}\left( {{y_i}|{Y_i}} \right) + P{r_2}\left( {{y_i}|{Y_i},{E_{T\!fusion}}} \right) \\ \quad \quad \quad \quad \quad \quad \quad \quad + P{r_3}\left( {{y_i}|{Y_i},{E_{fusion}}} \right) \end{array} \end{aligned} \end{aligned}$$
(18)

Where \({Y_i}\) is the output of the decoder, \(P{r_1}\left( \cdot \right) \) is the probability function for generating ordinary words from the vocabulary, and \(P{r_2}\left( \cdot \right) ,P{r_3}\left( \cdot \right) \) are the probability functions for words from tag and entity, respectively. We set the cross-entropy loss to optimize the generation of responses for the conversation module.

$$\begin{aligned} \begin{aligned} {\mathcal{L}_{gen}} = - \frac{1}{N}\sum \limits _{t = 1}^T {log} Pr\left( {{y_i}|\{ {y_{i - 1}}\} } \right) \end{aligned} \end{aligned}$$
(19)

Where T is the number of turns in dialogue and \({s_t}\) is the t-th sentence in the dialogue.

Experiments

In this section, we will introduce the datasets, the baseline model, the evaluation metrics, and the implementation details.

Datasets

In the CRS domain, two English conversation datasets, REDIAL [40] and INSPIRED [41], are widely used.

The REDIAL dataset is constructed by Amazon Mechanical Turk (AMT) with conversations centered around movies in which one party seeks recommendations (the user) and the other provides them (the system), and contains a total of 10,006 dialogues with a total of 182,150 sentences, involving 956 users and 51,699 movies.

The INSPIRED dataset is also an English dataset about movies, but it is a smaller dataset, containing 1,001 dialogues with a total of 35,811 sentences about 1,783 movies. Table 1 summarizes the statistics of the two datasets after processing.

Table 1 The REDIAL dataset vs. the INSPIRED dataset

Baseline model

We compared our approach to several competitive baseline models, including:

  • Redial [40]: It is a baseline model published on the REDIAL dataset and consists of two components: an autoencoder-based recommendation module and a response generation module using HRED.

  • KBRD [1]: The model utilizes DBpedia to augment the representation of items in context, then uses the Transformer architecture and incorporates information from the Knowledge Graph as a word bias for the response statements.

  • KGSF [13]: The model utilized two knowledge graphs, one based on the word level and the other on the item level. Align the semantic space of different knowledge graphs by mutual information maximization.

  • RevCore [14]: The model enhances the representation of items in context by introducing review information and making the generated responses more diverse.

  • \({\textbf{C}}^{\textbf{2}}\)-CRS [16]: The model uses Contrastive Learning to learn about data signals at different granularities to better fuse user preferences.

  • LOT-CRS [42]: The model has solved the long-tail problem in the CRS dataset and improved the recommendation performance.

In the above baseline Redial, KBRD, KGSF, RevCore, \(\mathrm {C^2}\)-CRS, LOT-CRS are all the models for conversational recommender systems.

Evaluation metrics

When evaluating the recommendation module and the conversation module, we need to use different evaluation metrics.

Table 2 Results of the recommendation task. We simplify Recall@k as R@k and MRR@k as M@k, respectively

For the recommendation module, we aim to verify that our model accurately captures user preferences and provides high-quality recommendations. Therefore, we evaluate the recommendation task using Recall@k and MRR@k (k=1,10,50), which are used to evaluate whether the top-k recommended items generated by the model contain the real item labels provided by the dataset.

For the conversation module, we employed two evaluation methods: automatic evaluation and human evaluation. In the automatic evaluation, we used Perplexity [43] and Distinct-n (n=2,3,4) [44] to evaluate the dialogue quality, Perplexity measures the fluency of response generation, with lower Perplexity values implying more fluent sentences. Distinct-n is used to measure the diversity of generated responses, where n represents n-grams, indicating the diversity of combinations of consecutive n words considered in a sentence. A higher Distinct value indicates that the generated responses are more diverse. In the human evaluation, the evaluator scores both sentence fluency and informativeness with a range of scores [0,2], The average of the evaluator’s scores is used as the human evaluation result to assess the quality of the dialog more comprehensively.

Implementation details

We implemented our MCCA model based on PyTorch and trained it on NVIDIA GeForce RTX 3090. We set the maximum length of the dialog context to 256, and the hidden dimensions of the recommendation and dialog modules to 128 and 300, respectively. To balance efficacy and efficiency, we employed two one-layer R-GCNs to encode two different knowledge graphs, with the normalization factor of R-GCN set to the default value of 1.0. The coefficient \(\omega \) for unsupervised contrastive learning loss is 0.05. We will first conduct pre-training for 3 epochs, and then proceed with training the recommendation module and the conversation module separately. During the training process, we utilize the Adam optimizer [45], with a batch size of 128 and an initial learning rate set to 0.001. Additionally, we employ gradient clipping strategy to limit the gradients within the range of [0, 0.1]. To ensure fair comparison and achieve optimal performance, the baselines’ hyperparameter settings follow their respective implementations.

Results and analysis

In this section, we verify the validity of our model through experiments and analyze the case.

Evaluation of recommendation module

Results analysis

Table 2 shows the results of different methods on the recommendation task, KBRD introduces knowledge graph to enhance item representation, KGSF makes significant progress on KBRD by using mutual information maximization for its item-level and word-level semantic space, RevCore enhances item representation by introducing reviews, \(\mathrm {C^2}\)-CRS improves the effect by better integrating user preferences through coarse-to-fine contrastive learning, and LOT-CRS solves the long-tailed problem of the dataset to make the model more effective and performs better than other baseline models. The effects brought about by these approaches are progressively enhanced, in terms of performance, the order is ReDialKBRDKGSFRevCore \(\mathrm {C^2}\)-CRSLOT-CRS.

According to the results in Table 2 what we can see is that our model outperforms all other baseline models. In terms of introducing external knowledge, we construct a new movie knowledge graph, MTKG, and introduce both item and tag information, at the same time, our knowledge graph is only for the movie domain, reducing other domain noise interference. In terms of enhancing user preference representation, we fully mine the potential tag preferences in the dialogue and design an item-level information fusion mechanism and a tag-level information fusion mechanism to enhance user preference representation, while fully mining the inter-item connections using unsupervised contrastive learning to improve the CRS performance. Compared with LOT-CRS, our model has shown an improvement of 14.2% in R@1 and 19.9% in R@10 on the REDIAL dataset. At the same time, there was a significant improvement in the MRR metric as well. On the INSPIRED dataset, there is an improvement of 10.6% in R@1 and 4.8% in R@10. However, the improvement on the INSPIRED dataset is not as significant as that on the REDIAL dataset, possibly due to the smaller size of the INSPIRED dataset. Overall, our model shows a noticeable improvement on both datasets, this showed that the new knowledge graph we constructed and the multi-source information fusion mechanism can effectively help the model accurately capture user preferences.

Ablation study

In the recommendation section, we obtain embedding representations of items and tags from the high-quality movie knowledge graph MTKG we constructed. We fused the MTKG item embedding with those extracted from the DBpedia and performed unsupervised contrastive learning between the fused item embedding and embeddings from different knowledge graphs. Additionally, we perform multi-tag fusion for tag embedding to integrate user preferences from multiple sources, enhancing the model’s recommendation performance.

In order to validate the effectiveness of our approach, we conducted a series of ablation experiments, in which we focused on the newly added components, (1) Removing the unsupervised contrastive learning module (w/o UCL); (2) Removing multi-source item fusion mechanism (w/o MIF); (3) Removing multi-tag fusion mechanism (w/o MTF); 4) Removing tag embedding (w/o Tag); 5) Removing MTKG item embedding (w/o M entity), i.e., use only item representations extracted from DBpedia; (6) Removing DBpedia item embedding (w/o D entity), i.e., use only the item representation extracted from MTKG.

Fig. 3
figure 3

Results of ablation experiments for Recall@10 and MRR@10 on the REDIAL dataset for recommended tasks

Table 3 The results of the automatic evaluation of the conversation task. We simplify Distinct-n as Dist-n and Perplexity as PPL

Based on the results of the ablation experiments in Fig. 3, we observed that each component plays an important role in improving the accuracy of the recommendations. Particularly, among all the metrics, the performance of "w/o UCL", and "w/o MIF" decreased dramatically, such results indicate that the fusion of knowledge graphs from different sources is essential because they contain rich information about the items and there are information gaps, and bridging them allows for a more effective representation of user preferences. In addition, the "w/o MTF" metrics decreased, suggesting that tag information popularity benefits the system in enhancing item representations and making more accurate recommendations. The sharp decrease in "w/o M entity" compared to "w/o D entity" indicates that there is less noise in the MTKG, and reducing the noise also improves the model.

Evaluation of conversation module

Results analysis

We evaluated the conversation task using both automatic and human evaluation. Tables 3, 4 show the results of the different evaluation methods on the conversation task, compared according to the corresponding metrics, and we highlight the best results in bold. Through automatic evaluation, the KBRD and KGSF models perform better, indicating that external knowledge and semantic similarity information contribute to generating better responses. The RevCore model further improves the performance of the conversation module by introducing reviews. \(\mathrm {C^2}\)-CRS better incorporates user preferences through multi-granularity Contrastive learning. LOT-CRS solves the long-tailed problem in the CRS dataset, further improving the performance of CRS.

Our model introduced tag information through the MTKG and employed unsupervised contrast learning to further enhance the item representations and improve the descriptiveness and diversity of system-generated responses. In terms of automatic evaluation, on the REDIAL dataset, our model achieved a significant improvement in automated assessment compared to the baseline model LOT-CRS, with Dist-2 and Dist-3 metrics improving by 10.0% and 8.8%, respectively. On the INSPIRED dataset, the Dist-2 and Dist-3 metrics improved by 8.9% and 7.0%, respectively. The results on both datasets show that our introduction of tag information is effective. In terms of human evaluation, the results in Table 4 show that our model also outperforms response fluency and diversity.

Table 4 Results of human evaluation for the conversation task.

Ablation study

Table 5 Ablation experiments on the REDIAL dataset regarding the conversation task

In order to verify the effectiveness of our model in the conversation task, we conducted ablation experiments, as shown in Table 5. "w/o F tag" means removing the fused tag embedding. "w/o F entity" implies the use of non-fused item embeddings, where only the item embeddings are spliced. The results of the ablation experiment "w/o F tag" show that movie tags can enhance the diversity of system responses. The "w/o F entity" results demonstrate that fusing items from various sources can improve the performance of the system’s production response.

Hyperparameter study

Fig. 4
figure 4

Parameters analysis on the REDIAL dataset

Effect of number of tags

In our approach, each movie corresponds to multiple tags, so the number of tags obtained is different. In order to explore the effect of the number of tags for each movie on the performance of the model, we conducted a series of experiments on the number of movie tags. As shown in Fig. 4a, we set the number of tags for each movie to 1, 2, 3, 4, 5, and 6. Our model shows the best performance on the recommendation task when the number of movie tags is set to 4. Therefore, we can observe that including rich tag information can improve the accuracy of the model in capturing user preferences, but if too many tags are included (e.g., when the number of tags is set to 5 or 6) too much noise is introduced, affecting the model’s ability to capture user preferences, and thus reduces the model’s recommendation performance.

Effect of weights

For the selection of weight \(\omega \) for the loss of unsupervised contrastive learning in the recommendation task, we analyzed Recall@10 for the recommendation task on the REDIAL dataset with other parameters fixed. The results are presented in Fig. 4b. From the figure, it can be observed that the model shows the best performance when \(\omega \) is set to 0.1. This means that the full use of information from different sources can indeed lead to a more accurate representation of user preferences, and setting an appropriate weight in the loss function can make better use of multi-source information and enable the model to better capture user preferences.

Effect of the number of R-GCN layers

Two R-GCNs are used in the MCCA model to encode different knowledge graphs separately, and we investigated the effect of the number of R-GCN layers on the model’s performance. Figure 4c shows the results of Recall@10 in the recommendation task on the REDIAL dataset. We noticed that the model performs best when using one layer of R-GCN. This shows that external knowledge can help the model understand user preferences, but as the number of layers increases, it introduces too much noise, leading to a decrease in model performance.

Case studies

In this section, we will use a case to illustrate how MCCA works, as shown in Fig. 5. First, the model extracts the item "Inception" in the context, Then, we retrieve the movie’s tags and select the top four popular tags, such as "science fiction action films", etc., and then get the embeddings of items and tags in the knowledge graph MTKG, and at the same time, we get the embeddings of items in the knowledge graph DBpedia, and then we fuse the embeddings of the items in the two knowledge graphs into a single-item embedding, and the fused embedding is fully mined for features by unsupervised contrastive learning with the embeddings in the two knowledge graphs; For tag embedding, we utilize the unsupervised YAKE method to extract the keywords from the reviews as an assistant for multi-tag fusion to get the final tag embedding. The recommendation module fuses the item embedding and the tag embedding to obtain the user preference representation, predicting movies that the user might be interested in, such as "Interstellar". The conversation module utilizes the dialogue context, tags, entities, and the predicted results from the recommendation module to generate responses. If the user dislikes or has already seen the recommended movie, the system will continue to interact with the user and dynamically update the user’s preferences. For example, in Fig. 5, the user has already seen "Interstellar" and asks for other recommendations, and the system then makes a re-recommendation based on the dialogue history.

Fig. 5
figure 5

Case study. Pink font denotes movie items and yellow font denotes descriptive words from tag information

Conclusion

We propose a Multi-source Information Contrastive Learning Collaborative Augmented approach (MCCA) to improve conversational recommender systems, we notice potential internal information that reflects user preferences, i.e., tag information. By constructing a Movie-Tag Knowledge Graph (MTKG) with tag information, we enhance the tag connections between items; Meanwhile, a multi-source item fusion mechanism is designed to bridge the information gap between different knowledge graphs and obtain the fusion item features, based on this, unsupervised contrastive learning is used to optimize the fusion item features, fully exploiting the connections of information in different knowledge graphs; And design a multi-tag fusion mechanism using keywords to assist in enhancing the tag information popularity to get the final tag features. Finally, the acquired tag preferences and item preferences are used to get a better representation of user preferences, and improve the quality of systematic responses. Extensive experiments have shown that our approach can sufficiently bridge the information gap between multi-source information in CRS and fully utilizes information popularity

The extensive introduction of external knowledge in existing studies leads to the noise problem, potentially biasing the captured user preferences. In our future work, we intend to explore more effective methods to address the bias in user preferences caused by external knowledge.