1 Introduction

In recent years, data mining on heterogeneous information networks has attracted extensive attention from both industrial and academic areas. For example, in an academic social network, users can search the most similar author or the most relevant paper based on the queries they input [1]. Similarity search is a popular data mining task based on strategies to find the objects most similar to a given object (query statement, node, etc.). Classical similarity search studies, including PageRank [2], HITS [3], and SimRank [4], focus on homogeneous information networks or binary networks, which have only one type of edge. However, data in the real world are very complicated; thus using a homogeneous information network to describe data cannot completely express the rich semantic information. Therefore, a heterogeneous information network (HIN) has become a great option to describe real data. Recently, searching and recommendation technologies in HINs have been gradually proposed [5,6,7,8,9,10,11,12,13,14,15], which aim to find similar objects in HINs based on given information.

Specifically, the main goal of the top-k similarity search task in HINs we will study is to obtain a set of nodes related to a given node according to the network characteristics. At present, predefined metapaths that can capture structural information are widely used in HIN-based research. Unlike in homogeneous information networks, different metapaths in HINs are specified to represent different semantics. Therefore, it is better to flexibly adjust different metapaths for different queries. Since nodes have both structural information and content information, how to integrate them to comprehensively represent nodes is also a great challenge that needs to be solved.

There are some unsupervised methods [6, 16] that consider the content and structure information of nodes for similarity search in HINs. However, these methods utilize a fixed mode to combine them. In fact, for nodes with few interactions, capturing their content information is more important. Meanwhile, for active nodes, the main task is to train their structure information in the given HIN for similarity search. Therefore, it is necessary to flexibly consider the content information and structure information of nodes.

Machine learning-based methods utilize node embedding in HINs for similarity search, that is, measure the similarity of nodes according to their vectors’ similarity. Metapath2vec [17] and some other methods [5, 18,19,20,21,22,23,24] only represent nodes based on one metapath, which is useful for tasks with specific semantic requirements. For recommendation tasks, MCRec [11] uses an attention mechanism to combine multiple metapath information to realize similarity measurement between users and items, but it does not consider the content information of nodes. HetGNN [16], which considers the content and structure information on nodes, uses the attention mechanism for embedding, while the content and structure information are not comprehensively combined. Specifically, when the model is trained by the nodes’ structure information, the content features fade or even disappear. In addition, none of the above methods considers the time factor contained in the network. Therefore, it is possible that two nodes with completely opposite changing trends are considered to be the most similar. In addition, none of the above methods consider the time factor contained in the network. Therefore, it is possible to consider two nodes with completely opposite trends as the most similar. In addition, some data distribution information in the network will also affect the characteristics of nodes.

To achieve the similarity search task in HIN and to overcome the above challenges, including adjusting different metapaths for different queries, flexibly integrating structure information and content information to comprehensively represent nodes, and considering the time factor and other data distribution information, we propose a model for top-k similarity search based on a double channel convolutional neural network in weighted HINs (called SSDCC+), which uses a CNN with two channels to train the content information and structure information at the same time. We generate different structure and content embeddings for each node according to different metapaths, which has not been done in other previous studies, and extract the internal attribute information of the node. We sample path instances for given metapaths and calculate their weights. To train content information and structure information at the same time, a double channel CNN is used in SSDCC+. Our model can distinguish the metapaths for each node by an attention mechanism because the importance of metapaths is not the same for each node. Another attention mechanism, which considers the outputs of the two channels and the attribute information of the data, is proposed to combine the node’s content and structure information for its comprehensive representation. Finally, an evaluation function is designed using the time information contained in the objects and the data distribution information to evaluate the importance of the content and structure information.

To summarize, the major contributions of our work are as follows:

  • To the best of our knowledge, this is the first attempt to apply the double channel convolution neural network to HINs. We use it to train the structure information and content information of nodes simultaneously for similarity search, which avoids the problem that the content information embedded in the previous training of nodes decays with the following structure training.

  • Two different attention mechanisms are designed in SSDCC+ to fully realize the personalization of the constructed model. First, an attention mechanism is used for multiple metapaths when embedding structure information, which flexibly and personally implements comprehensive consideration of structural information under different metapaths for each target node. Second, another attention mechanism is used between the content representation and the structure representation to combine them.

  • We design a content and structure importance evaluation function based on the attribute information of the dataset to fuse the output results of two channels and propose a dual channel fusion module under the influence of attributes to improve the accuracy of search results and make the model more explainable.

  • We have implemented our model on a public dataset to evaluate its performance and compared it with the existing methods. Through the analysis of the results, this model can deal with the problem of similarity search on HINs well and achieved higher performance than the existing methods.

The rest of this paper is organized as follows. Section 2 describes related works involved in this research. Section 3 introduces some definitions used in this paper and states our problem. Section 4 implements a comprehensive representation of multiple path instances between nodes. Section 5 presents a double channel-based search model to train the content and structure information of nodes. Section 6 designs an evaluation function and fuses two channels under the influence of attributes to improve the model. Section 7 conducts experiments and analyzes the results. Section 8 summarizes the conclusions and the contributions of this paper.

2 Related work

In this section, we briefly review work in three areas that are closely related to ours: network embedding on HINs, double channel convolutional neural networks related research, and the top-k similarity search task on HINs.

2.1 Network embedding

In recent years, with the development of HINs [25], a large number of network embedding methods for HINs have emerged. Utilizing node embedding on HIN, the nodes’ representations are turned from high-dimensional sparse vectors to low-dimensional vectors while the relationship information between nodes is preserved. Some researchers use this kind of node vector to measure the similarity of nodes directly.

The deep learning-based method is currently a mainstream node embedding method [26,27,28,29,30]. In homogeneous information networks, DeepWalk [18] and TriDNR [20] implement node embedding based on random walks, NAIE [31] design an integrated autoencoder to learn a more informative representation of nodes by integrating structure and attribute information, and there are many studies using GNNs [32]. In HINs, the main research includes metapath2vec [17], HIN2vec [21], HAN [22], H2Rec [33], etc. Metapath2vec represents nodes based on random walks, and HIN2vec uses a neural network to learn the representation of nodes and metapaths. However, both of them only consider a single metapath; thus, they cannot capture rich path semantic information of nodes in HINs. HAN adopts a node-level attention mechanism and a semantic-level attention mechanism, and uses the latter to distinguish metapaths to obtain semantic information. However, its parameters are shared for all metapaths, resulting in the same metapaths attention scores for different nodes. H2Rec considers both homogeneous and heterogeneous information networks, but fuse their embedded information based on a simple sigmoid function.

The similarity search method based on the above models cannot flexibly adjust the search according to the network state, which means that a more appropriate model is needed.

2.2 Double channel convolutional neural networks

Double channel convolutional neural networks refer to a two-stream architecture that trains data in two streams simultaneously and fuses them. It has been widely studied and used in machine learning tasks related to image processing, and was first proposed in [34] for action recognition in videos. [35] applied double channel convolutional neural networks to an air quality measurement model, and they use each channel to train different parts of the environment images for feature extraction. MDCC-Net [36] designs multiscale double channel convolution feature fusion module. The double channel convolutional neural network is used to combine features of different scales in the same encoding layer, which fully extracts the detailed information of the original image and learns more tumor boundary information. In other studies, double channel convolution neural networks have also been used for polarimetric SAR image classification [37], image smoke detection [38], image reconstruction [39], power edge image recognition [40], hand-drawn sketch recognition [41], hyperspectral image classification [42], etc.

The various double channel convolutional neural networks designed in the above researches are mainly used to extract image features from different aspects. In this work, we attempt to use two channels to simultaneously capture the content information and structure information of nodes in the network, so existing models cannot be directly applied. We need to design a new framework suitable for the similarity search task on weighted HINs.

2.3 Top-k similarity search

There have already been many works to measure the similarity of objects in HINs. PathSim [12] uses metapaths and defines the similarity between two objects of the same type by considering the accessibility and visibility between vertices to solve the similarity search problem. HeteSim [23] is an extension of PathSim, which can measure the similarity of different types of objects. W-PathSim [13] proposes another improvement of PathSim by using the weighted cosine similarity of topics. RoleSim [43] considers both structural features and Jaccard similarity of attribute features. TopCPathSim [44] is proposed to address the problem related to topic-driven similarity search based on a constrained metapath. NeuPath [45] transforms PathSim computation to a deep learning model, which relieves the limitation of calculating large graph in traditional PathSim. SimRank [4] is a classical similarity search method on a homogeneous information network, and it assumes that two objects are similar if they are referenced by a similar object. Carmo [46] addresses the problem of set query with SimRank and applies it to personalized link prediction and event recommendation. CrashSim [47] enables computation of SimRank in both static networks and temporal networks. Similarity search is also applied to recommendation tasks and Holistic influence maximization tasks [48]. For example, MCRec [11] use an attention mechanism to measure the similarities between users and items. However, different users and items will jointly affect the weight of metapaths, which makes the model inefficient. In fact, only calculating the weight according to the user information to provide recommendations can also meet the needs.

The above approaches only consider a single aspect, such as content or structure similarity in HINs, and some only consider one metapath when measuring the structure similarity.

Some methods [6, 16] considering content and structure information usually embed the content information first and then embed the structure with the content embedding result as the initial input. However, similar to an optimization model seeking the minimum value, the initial state usually has little effect on the final result. In other words, as the model is trained by the nodes’ structure information, the content features contained in the initial representation will fade or even disappear. In addition, the above methods do not consider the time information in the network and treat the interaction at different times equally.

3 Preliminaries

In this section, we briefly introduce the preliminaries that will be used later and formulate the problem.

Definition 1

(Heterogeneous Information Network (HIN)) Given an information network, it can be represented by a directed graph \(G=(V,E,A,\) \({\mathcal {R}},\) Al\(\phi ,\) \(\varphi )\), where V is the object set, E is the link set, A is a collection of object types, \({\mathcal {R}}\) is the set of link types, and Al is the attribute set corresponding to different types of links in \({\mathcal {R}}\). In addition, \(\phi :V \rightarrow A\) is a mapping function from the object set to its type set, where each object \((v \in V)\) belongs to a specific object type \((\phi (v) \in A)\). \(\varphi :E\rightarrow {\mathcal {R}}\) is a mapping function from the link set to the link-type set, where each link \((e\in E)\) belongs to a specific link type \((\varphi (e)\in {\mathcal {R}})\). When the number of object types \(\vert A\vert >1\) (or the number of link types \(\vert R\vert >1\)), this network is a heterogeneous information network (HIN).

In an HIN, if the attribute values are not all equal to 0, that is, \(\exists x, al^i_x \ne 0\) (\(al^i_x\) is the value of the attribute corresponding to edge x, \(al^i_x\in Al\)), the network is a weighted heterogeneous information network (weighted HIN).

Example 1

Considering an HIN on bibliographic data in Fig. 1, there are three types of nodes: authors, papers and conferences. Four types of links exist: writing and written-by between authors and papers, publishing and published-in between papers and the venues. The attribute position corresponding to writing and written-by represents the position of the author in the author list of the paper, and the attribute time corresponding to publishing and published-in stands for the year in which the paper was published in this venue.

Fig. 1
figure 1

Example of HIN on bibliographic data

We also use a metapath to express the rich path semantic information in an HIN.

Definition 2

(Metapath) A metapath expressed as \(A_0{\mathop {\longrightarrow }\limits ^{{\mathcal {R}}_0}}A_1{\mathop {\longrightarrow }\limits ^{{\mathcal {R}}_1}}\cdots {\mathop {\longrightarrow }\limits ^{{\mathcal {R}}_{l-1}}}A_l\) is a path in an HIN, where \(A_0,A_1,\cdots ,A_l\in A\) and \({\mathcal {R}}_0,{\mathcal {R}}_1,\cdots ,{\mathcal {R}}_{l-1}\in R\). It defines a combination relationship \({\mathcal {R}}={\mathcal {R}}_0 \circ {\mathcal {R}}_1 \circ \cdots \circ {\mathcal {R}}_{l-1}\) between object types \(A_0\) and \(A_l\). For simplicity, we directly represent a metapath as \(P=(A_0,A_1,\cdots ,A_l )\). We call a specific path \(p=(a_0,a_1,\cdots ,a_l\vert w_p)\) of a given metapath P a path instance, where \(w_p\) is the weight of path instance p, and it can be calculated according to the attribute \(al \in Al\) of links \(\{(a_0,a_1),\) \((a_1,a_2),\) \(\cdots ,\) \((a_{l-1},a_l)\}\) on the path instance p.

Definition 3

(Top-k similarity search in HIN) Given an HIN, in which \(G=(V,E,A,\) \({\mathcal {R}},\) Al\(\phi ,\) \(\varphi )\), its metapath set \(\{ P_1, P_2,\cdots ,P_x\}\), a query node \(n_q (n_q\in V)\), and the number of results k. Top-k similarity search aims to find the most similar nodes of \(n_q\) from G based on a similarity measure method F. How to define a proper F will be studied in this paper.

4 Capturing semantic information between nodes

The purpose of our research is to find the k nodes that are most similar to the nodes queried in the weighted HINs, so we need to calculate how similar it is to all other nodes in the network. In this section, we will implement a comprehensive representation of multiple path instances between two nodes, thereby capturing semantic information between nodes. First, we use the pretraining method to generate content and structure representations of nodes and extract their attribute information. Then, we generate inputs by sampling the path instances starting from the target node, which is the object to be queried, and design the neural network model to obtain the representation of the path instances. Finally, we propose a comprehensive path representation model with an attention mechanism between metapaths. Furthermore, to capture the rich information of HINs, we add an on-path attributes capture module during the path instance sampling process.

4.1 Node representation and attribute extraction

Node information in HINs can be divided into two parts, including content information and structure information, as shown in Fig. 2. The content information refers to text descriptions and other attribute information, such as images and labels, while the structure information refers to the connection relationship between nodes on the network.

First, we extract the content information of nodes. Specifically, it includes capturing nodes’ attribute information and representing text descriptions. For example, Fig. 2a shows the content information of a paper-type node. The abstract can be used for content embedding, and others can be used to extract attribute information. We capture the attribute information \(An=\{an_1 , an_2, \cdots , an_l\}\) of each node and use it in Sect. 4.2.3. Considering a node a, we use Doc2vec [24] to generate its text representation \({\varvec{a}}^c \in {\mathbb {R}}^{\vert A\vert \times d_1}\), where \(\vert A\vert\) is the number of nodes associated with type A and \(d_1\) is the nodes’ content embedding dimension.

Second, we use another embedding method to represent nodes’ structure information. Many models have been proposed to embed nodes in HINs through metapaths because of the rich semantics they involve. In this research, we use metapath2vec++ [17] to generate nodes’ structure representation \({\varvec{a}}^s \in {\mathbb {R}}^{r \times \vert A\vert \times d_2}\), where r is the number of metapaths, and \(d_2\) is the nodes’ structure embedding dimension. Moreover, metapath2vec++ trains nodes based on a single given metapath. Therefore, for each metapath \(P_i\), we can obtain the nodes’ structure representation \({\varvec{a}}_i^s\), where \(i \in \{1,\cdots ,n\}\) denotes the given metapath id.

Fig. 2
figure 2

Node representation and attribute information extraction

4.2 Path representation

The goal of path representation is capturing the rich semantic information between two nodes personally and flexibly and obtaining the path structure representation and the path content representation. Its framework is shown in Fig. 3.

Fig. 3
figure 3

Path representation

4.2.1 Integrated embedding of nodes

In Sect. 4.1, the model generates a series of structure embeddings \({\varvec{a}}_i^s\) corresponding to r given metapaths, and it also generates one content embedding \({\varvec{a}}^c\) for each node. To map the vectors of nodes to a new space, we design three dense layers Dense 0, Dense 1 and Dense 2. We obtain the integrated structure embedding \({\varvec{a}}^s\) of nodes by inputting \({\varvec{a}}_i^s\) into Dense 1 and obtain the integrated embedding \({\varvec{a}}^{s+c}\) of nodes including both structure and content by inputting \(\varvec{a}_i^s\) and \({\varvec{a}}^c\) into Dense 0. We also input \({\varvec{a}}^c\) into Dense 2 to maintain consistency.

4.2.2 Path instance and metapath representation

Many methods calculate nodes’ similarity in HINs by their embedding vector’s similarity directly. Inspired by the recommendation model MCRec [11], which uses path instances as input to capture the nodes’ interaction, we represent metapaths to capture rich path semantic information between nodes in similarity search tasks. As shown in Fig. 3, for the target node \(a_1\), we take it as the starting point to sample the path instance according to the designed metapath. Under the metapath \(P_i (i\in \{1,\cdots ,n\})\), we obtain multiple path instances \(p_{i.j}(j\in \{1,\cdots ,m\}\) is the id of a path instance) from \(a_1\) to another node \(a_2\) in the weighted HIN and calculate the path weight according to the attributes on the edge between adjacent nodes. In SSDCC+, we use a weight-based random walk to obtain path instances and record their weights. Specifically, in each time unit, it will take a step to one of its neighbours according to the transition probability, which comprehensively considers multiple attribute factors between nodes, such as time information. In addition, to reduce the training time, we abandon some path instances with low weight.

In a weighted HIN, different relationships on a path instance contain different attributes, such as the time when the author attended the venue and the author’s station of a paper. The values of these attributes can enrich the information of nearby links and nodes and reflect the data distribution characteristics of the whole dataset. A dynamically designed model according to data characteristics can make the model more explainable and improve the accuracy of the search result. Therefore, for a given metapath \(P_i (i\in \{1,\cdots ,n\})\), we capture attributes \(Al_{i.j}=\{al_1 , al_2, \cdots , al_l\}\) contained in each path instance \(p_{i.j}(j\in \{1,\cdots ,m\})\). In Sect. 6, we will design the evaluation function and use \(Al_{i.j}\) to improve the model.

During the pretraining process, the embedding of nodes on different path instances is generated corresponding to specified metapaths. Then, for nodes’ structure information, we obtain the embedding of path instance \({\varvec{p}}_{i.j}^s (i\in \{1,\cdots ,n \},j\in \{1,\cdots ,m\})\) through convolutional layers. After that, the model obtains the embedding of metapath \({\varvec{P}}_i^s\) through a pooling layer. For nodes’ content information, we obtain the embedding of path instance \({\varvec{p}}_{i.j}^c\) and the embedding of metapath \({\varvec{P}}_i^c\) in the same way.

4.2.3 Path representation

To obtain the comprehensive representation of the path containing all metapaths \(P_i\) while distinguishing each target node, we apply an attention mechanism to fuse different metapaths. Considering that the path attention weight is affected by the target node, we use an attention scoring function with Bahdanau’s additive style to calculate the importance of each metapath \(P_i\) to node \(a_m (m\in \{1,\cdots ,\vert A\vert \})\). The importance score \(score(a_m,P_i )\) is calculated as follows:

$$\begin{aligned} score(a_m,P_i)={\varvec{v}}_\alpha ^T\cdot \tanh (W_1\cdot {\varvec{a}}_m+W_2\cdot {\varvec{P}}_i) \end{aligned}$$

where \(W_1\) and \(W_2\) are trainable weight matrices, \(\varvec{v}_\alpha ^T\) is a trainable vector, and \({\varvec{P}}_i\in \{{\varvec{P}}_i^s,{\varvec{P}}_i^c \}\) is the embedding of metapath \(P_i\). After obtaining the importance score \(score(a_m,P_i)\), the attention weights of \(P_i\) can be calculated through the softmax function:

$$\begin{aligned} \alpha _{a_{m,n}-P_i}= \frac{\exp {score(a_m,P_i)}}{\sum _{i'=1}^{k}\exp {score(a_m,P_{i'})}} \end{aligned}$$

which can be interpreted as the contribution of the metapath \(P_i\) for node \(a_m\). \(a_{m,n}-P_i\) indicates that node \(a_n\) has a path to \(a_m\) under metapath \(P_i\). Based on the attention weights \(\alpha _{a_{m,n}-P_i}\) in Function 2, path representation can be computed as the weighted average of the metapath representation \({\varvec{P}}_i\). The content representation \({\varvec{C}}_{a_{m,n}}\) and structure representation \({\varvec{S}}_{a_{m,n}}\) of the path are as follows:

$$\begin{aligned}&{\varvec{S}}_{a_{m,n}} = \sum _{i}\alpha _{a_{m,n-P_i}}\cdot {\varvec{P}}^S_i \end{aligned}$$
$$\begin{aligned}&{\varvec{C}}_{a_{m,n}} =\sum _{i}\alpha _{a_{m,n-P_i}}\cdot {\varvec{P}}^C_i \oplus an_1 \oplus an_2 \oplus \cdots \oplus an_l \end{aligned}$$

where \(an_1, an_2, \cdots , an_l\) are the attributes of nodes on all path instances in \(P_i\).

5 Double channel-based search model and algorithm

The model is designed for similarity search tasks on a weighted HIN. To train the content information and structure information of each node, we design a double channel convolutional neural network, and the output is the content representation and structure representation of paths’ semantics between two nodes. In addition, to capture sufficient text description information, we add the attributes captured in the pretraining part to the content channel. We use another attention mechanism on the above two channels’ outputs to generate a comprehensive representation of the semantics between two nodes. Finally, we design the similarity search algorithm.

In this section, we propose the model for top-k similarity search based on a double channel convolutional neural network on weighted HINs (abbreviated as SSDCC+) and give the specific top-k similarity search algorithm. The model is divided into three parts: node representation, path representation and the combination of content and structure. We will elaborate the details in the following sections.

5.1 Double channel CNN-based path representation

The key challenge in SSDCC+ is how to comprehensively consider the content information and the structure information of the object. The existing methods [49, 50] train the nodes’ structure information after content information, which can be considered to be a representation just for structure information. To comprehensively represent the content information and the structure information, we construct a double channel convolutional neural network-based method. The content representation and structure representation are trained by different channels called Cont-channel and Stru-channel, respectively, based on the double channel mechanism.

In Cont-channel, nodes on the path instance \(p_{i.j}\) are represented by \({\varvec{a}}^c\) described in Sect. 4.1. After path representation in Sect. 4.2, we acquire the content embedding \({\varvec{C}}_{a_{m,n}}\) for given metapaths, which consider both text description and attribute information. Analogously, in Stru-channel, nodes on the path instance \(p_{i.j}\) are represented by \({\varvec{a}}^s\), and nodes on different metapaths \(P_i\) have their corresponding embedding \({\varvec{a}}_i^s\). Finally, we acquire the structure embedding \({\varvec{S}}_{a_{m,n}}\) from this channel.

5.2 Combination of content and structure based on attention

From previous work, we obtain content embedding \({\varvec{C}}_a\), structure embedding \({\varvec{S}}_a\) and integrated embedding \({\varvec{a}}^{s+c}\). In this section, we combine them by another attention machine, as shown in Fig. 4.

Fig. 4
figure 4

The attention mechanism for combining two channels

The attention score over different representations is:

$$\begin{aligned}&score({\varvec{a}}^{s+c},{\varvec{C}}_{a_{m,n}}) = \varvec{v}_\beta ^T \cdot \tanh (W_3 \cdot {\varvec{a}}^{s+c} + W_4 \cdot {\varvec{C}}_{a_{m,n}}) \end{aligned}$$
$$\begin{aligned}&score({\varvec{a}}^{s+c},{\varvec{S}}_{a_{m,n}}) = \varvec{v}_\beta ^T \cdot \tanh (W_3 \cdot {\varvec{a}}^{s+c}+W_4 \cdot {\varvec{S}}_{a_{m,n}}) \end{aligned}$$

where \(W_3\), \(W_4\) are trainable weight matrices and \(\varvec{v}_\beta ^T\) is a trainable vector. Based on these two scores, the attention weights of content representation and structure representation are computed as follows:

$$\begin{aligned} \alpha _C= & {} \frac{\exp {(score({\varvec{a}}^{s+c},\varvec{C}_{a_{m,n}}))}}{\exp {(score({\varvec{a}}^{s+c},\varvec{C}_{a_{m,n}}))}+\exp {(score({\varvec{a}}^{s+c},\varvec{S}_{a_{m,n}}))}} \end{aligned}$$
$$\begin{aligned} \alpha _S= & {} \frac{\exp {(score({\varvec{a}}^{s+c},\varvec{S}_{a_{m,n}}))}}{\exp {(score({\varvec{a}}^{s+c},\varvec{C}_{a_{m,n}}))}+\exp {(score({\varvec{a}}^{s+c},\varvec{S}_{a_{m,n}}))}} \end{aligned}$$

Then, we acquire the combination of content and structure with the function shown as follows:

$$\begin{aligned} {{\varvec{C}}}{{\varvec{S}}}=\alpha _C \cdot \varvec{C}_{a_{m,n}}\oplus \alpha _S \cdot {\varvec{S}}_{a_{m,n}} \end{aligned}$$

In this function, we concatenate the results of two channels after multiplying by attention weights. Compared with the methods that summarize them together directly, such as other attention mechanisms, utilizing this function can effectively help us reduce information loss.

5.3 Objective function and model training

We have obtained the comprehensive representation of content and structure information between target node \({\varvec{a}}_m\) and its similar node \({\varvec{a}}_n\). Furthermore, to carry out the top-k similarity search task, we need a similarity value between the two nodes. Therefore, we feed the representation \({{\varvec{C}}}{{\varvec{S}}}\) into the MLP component to calculate the final output, that is:

$$\begin{aligned} y_{m,n}=sigmoid[f({{\varvec{C}}}{{\varvec{S}}})] \end{aligned}$$

where f is the MLP component that has two dense layers with the ReLU activation function. Its output will be fed into a sigmoid layer to obtain the result \(y_{m,n}\).

We use a logarithmic loss function in the model. Therefore, \(y_{m,n}\) of node pair (mn) in which the node is similar to the other equals 1 while the dissimilar ones equal 0. The objective function is shown as follows:

$$\begin{aligned} o=-\frac{1}{\vert N\vert }\sum _{i=1}^{\vert N\vert }(y_i\log {p_i}+(1-y_i)\cdot \log {(1-p_i)}) \end{aligned}$$

where \(y_i\) is the output of the model, N is the input sample set, and \(p_i\) is the probability that the predicted input instance is similar.

Intuitively, a target node must have more path instances between it and its similar node under a given metapath, and the output \(y_{m,n}\) must be larger than others. Therefore, we use negative sampling to train the model where the negative dataset includes node pairs without path instance to \(a_n\) under the given metapath. We set the score of these dissimilar node pairs to zero, and then the objective function can be formulated as follows:

$$\begin{aligned} o=-\frac{1}{\vert N^+\vert }\sum _{i \in N^+}y_i\log {p_i}- \frac{1}{\vert N^-\vert }\sum _{j \in N^-}(1-y_j)\log {(1-p_j)} \end{aligned}$$

where \(N^+\) is the positive sample set and \(N^-\) is the negative sample set.

5.4 Top-k search algorithm

In this section, we introduce the top-k search algorithm to obtain the first k most similar nodes of the target node by Algorithm 1. Specifically, we first input the HIN H and the metapath set \(\{ P_1, P_2,...,P_x\}\) to the SSDCC+ model. The model is trained until the loss function stabilizes or reaches the threshold of training times we set in advance. For the target node \(n_q(n_q\in V)\), we can obtain a similar score to every node in H. Then, other nodes are sorted according to the score \(y_{n_q,n_i}(i=1,2,...,\vert V\vert , n_i\in V\), representing each node in graph H). Finally, we take the nodes ranked in the first k position as the output of the search algorithm.

figure a

6 Dual channel combination under the influence of attributes

The method proposed in Sect. 5.2 uses an attention mechanism to fuse the dual channel’s output, which is dynamic and personalized, but the explainability is insufficient. According to the distribution characteristics of data in the network, it is more reasonable to analyse and use attribute information of data for dual channel fusion. In this section, we will design a function to evaluate the importance of Cont-channel and Stru-channel and propose a dual channel fusion model under the influence of attributes to improve the accuracy of search results and make the model more explainable.

The attribute information is extracted during the path sampling process, as described in Sect. 4.2.2. To evaluate the importance of the two channels, we first need to determine which attributes affect it and how these attributes affect it.

6.1 The weight of time difference

First, in weighted HINs, we consider that the time difference \(\bigtriangleup t \in Al_i\) (\(Al_i \subseteq Al\) is a collection of attributes that contain attributes on the metapath \(P_i\)) in metapath \(P_i\) is a key factor. Intuitively, for the path instance \(p_{i.j}\) with two time attributes, the greater the difference between the two time attributes is, the greater the uncertainty of the information contained in it. Therefore, the importance of structural information between two nodes decreases, while the importance of content information increases. Under the influence of the time difference, the pivotal weights of \(p_{i.j}\)’s structure information are as follows:

$$\begin{aligned} \beta _{i.j}^t= \frac{\beta _{min}^t-\beta _{max}^t}{t_{max}-t_{min}} \times \bigtriangleup t + \beta _{max}^t \end{aligned}$$

where \(\bigtriangleup t\) is the time difference and \(t_{min}\) is the earliest time in the dataset, i.e. the minimum value of the time attribute, \(t_{max}\) is the latest time, \(\beta ^t_{min}\) is the minimum value of \(\beta ^t\) and \(\beta ^t_{max}\) is the maximum value of \(\beta ^t\).

Then, the comprehensive structure pivotal weight \(\beta ^t\) between two nodes is calculated based on the abundant path instance information contained in multiple metapaths:

$$\begin{aligned} \beta ^t= f(\beta _{1.1}^t, \dots , \beta _{i.j}^t, \dots , \beta _{n.m}^t) \end{aligned}$$

where n is the number of metapaths between two nodes, m is the number of path instances under the last metapath \(\{P_{n}\}\), and f is a statistical function suitable for the data distribution of the current weighted HIN. In this model, we simply design f as a function to calculate the average value of the input data. In addition, we should not ignore structural information because the time difference is too large or ignore content information because it is too small. Therefore, there is a restriction on the value range: \(0<\beta ^t_{min}\le \beta ^t \le \beta ^t_{max}<1\).

6.2 The weight of path instances’ number

We believe that another key factor affecting the importance of Cont-channel and Stru-channel is the number of path instances under a given metapath. Although it is not an attribute corresponding to a type of link, it reflects the characteristics of the data distribution in the network. The more path instances there are, the more active the node is in the network, and the more important the structure information is; in contrast, the content information is more important. Under its influence, the pivotal weights of structural information are as follows:

$$\begin{aligned} \beta ^{num}= \frac{\beta _{max}^{num}}{num_{max}} \times num \end{aligned}$$

where num is the number of path instances between two nodes, \(\beta _{max}^{num}\) is the maximum value of \(\beta ^{num}\), and \(num_{max}\) is the maximum number of path instances we set according to the distribution of data. Theoretically, if the number of path instances is zero, it is not suitable to consider structure information because the path instance part of the model input is a zero vector, and it is meaningless to calculate the similarity with \({\varvec{0}}\). Consequently, there is a restriction on the value range: \(0 \le \beta ^{num} \le \beta ^{num}_{max}<1\).

6.3 Importance evaluation function

In this research, the influence of \(\beta ^t\) and \(\beta ^{num}\) on channel importance is considered equal. The structure and content pivotal weights under the influence of attributes are obtained as follows:

$$\begin{aligned} \beta _C= & {} \frac{\beta ^t+\beta ^{num}}{2} \end{aligned}$$
$$\begin{aligned} \beta _S= & {} 1 - \beta _C \end{aligned}$$

Combined with the results of Sect. 5.2, the output of the two channels is fused by considering both the attention mechanism and attribute importance. The specific function is changed from Function 9 to:

$$\begin{aligned} {{\varvec{C}}}{{\varvec{S}}}_{atten+attr}=\alpha _C \cdot {\varvec{C}}_{a_{m,n}}\oplus \beta _C \cdot \varvec{C}_{a_{m,n}}\oplus \alpha _S \cdot {\varvec{S}}_{a_{m,n}} \oplus \beta _S \cdot {\varvec{S}}_{a_{m,n}} \end{aligned}$$

In this function, we multiply the attribute weights \(\beta _C\), \(\beta _S\) with the structure representation and content representation and add them to the result of Function 9. This enhances the explainability and dynamics of the model.

According to the above analysis, we update the similarity score (Function 10) as:

$$\begin{aligned} y_{m,n}=sigmoid[f({{\varvec{C}}}{{\varvec{S}}}_{atten+attr})] \end{aligned}$$

In summary, the overall framework of SSDCC+ is shown in Fig. 5. The model is divided into five parts: Node Representation, Node Attribute Capture, Path Representation, Channel Importance Evaluation, and Combination of Content and Structure. Specifically, Sect. 4 carries out the node representation and attribute extraction, takes the result as the input of path representation, and obtains the single channel output containing semantic information. Section 5 proposes the dual channel combination function, which uses the attention mechanism to fuse the content and structure information. Then, in Sect. 6, the importance evaluation function is proposed to improve the model, and the dynamic and reasonable nodes’ similarity score are obtained.

Fig. 5
figure 5

The overall framework of SSDCC+

7 Experimental evaluation

In this section, we describe the datasets of our experiment, prove the accuracy of SSDCC+ and compare it with variants and baseline methods.

7.1 Environment and data collection

We use the Academic Social (AS) Network from AMiner [51, 52], which is publicly available, for the experiments. This network includes 2.1M papers, 8M citations, 1.7M authors and 4.3M coauthors. We construct a weighted heterogeneous graph based on three types of nodes: authors, papers, and venues. We also extract the attributes of different types of nodes, different relation edges among three sets of nodes (authors writing papers, papers published on venues, authors collaborating on a paper, and papers cited by other papers), and the attributes corresponding to specific edges to construct the network. We prune all papers that lack information on affiliations, year and publication venue.

The data of the original network AS are too large. We extracted two subsets, AS-5 and AS-small, for experiments. The AS-5 contains papers published from 2010 to 2014, their related authors, and conferences where papers were published. However, it is a sparse dataset containing a large number of unpopular authors and conferences. In order to alleviate the high computational cost and observe the experimental results on popular nodes, we further extract the data to construct another dataset AS-small, with similar extraction methods appearing in other studies [12, 53]. AS-small is a smaller dataset, screening conferences with at least 900 papers published between 2010 and 2014 and papers with at least one author contributing five papers. Compared with the first dataset, the nodes’ popularity on AS-small is higher, which means that it is easier to evaluate the similarity of nodes in manual annotation. The statistics can be found in Table 1. In particular, the number of nodes Paper in the table is the same as the number of relationships. This is mainly because each paper can only be published once in one venue.

figure b
Table 1 Dataset statistics of AS-5 and AS-small

7.2 Experimental setup — metapath and model parameters

7.2.1 Metapath selection

In academic social networks, we take authors as target nodes. Although similarity measurement methods such as PathSim [12] cannot synthesize multiple metapaths at the same time to measure node similarity, the metapath used in it is reasonable. It presents two metapaths, APA and APVPA, which represent the scenarios of coauthoring papers and publishing papers in the same venue, respectively. In addition, metapath APPA indicates that one author cited another author’s paper, which is also very important.

Since the AS-small dataset contains only 8 conferences and they have published a large number of papers, any two authors in AS-small have at least 1/8 of the probability of being connected through APVPA, which makes the training results poor. Therefore, metapaths AA, APPA, and APVPA are used to train and test the SSDCC+ model for similarity measurement of author-type objects on AS-5, and metapaths APA and APPA are used on AS-small.

7.2.2 Parameter settings

We select the text description of the nodes as the content information and training data by Doc2vec to obtain content embeddings. Then, metapath2vec is used to embed node structure information under different metapaths. We set the embedding dimension of both content and structure to 128, that is, \(d_1=128\) and \(d_2=128\). In addition, all parameters are initialized using a uniform distribution. Here, we set the model training epochs to 5 and the number of negative samples to 5. The minibatch Adam optimizer is selected to optimize these parameters, where the learning rate is set to 0.001 and the batch size is set to 512. Moreover, to make the method scalable, we use multiple processes to train the model. Specifically, 6 processes are used to train in parallel on the experimental machine, and the training speed is increased by 6 times.

7.2.3 Capture of attributes and weights

In Sect. 4.1, the model captures attributes of nodes and adds attributes to the Cont-channel to enrich the content information. For authors, the attribute includes the count of published papers, the total number of citations,the H-index of this author, etc.; for papers, the attribute includes affiliations, year, etc.

In Sect. 4.2.2, the model uses a weight-based random walk (RW) to obtain the path instances and calculate their weights. The weight between two adjacent nodes on the path instance is calculated based on the value of the corresponding attributes of the edges between them. For example, the value of the attributes between two papers is the time difference of their publication (the larger the time difference is, the smaller the weight is), the value of the attribute between an author and a paper is the author’s position, and so on. In the random walk process, the probability of walking from the current node to any of its neighbours is equal, but the weight between the passed node pairs will be recorded. Then, the weight of a path instance is the product of all the weights on it. After that, we only keep a few path instances with the greatest weights. The statistics of path sampling results on datasets AS-5 and AS-small are shown in Table 2.

Table 2 Number of path instances on AS-5 and AS-small

7.2.4 Comparison experiment

Then, we compare the search algorithm based on SSDCC+ against its baselines and variants:

  • SSDCC [54]: Compared with SSDCC+, this model ignores the attribute information of objects; that is, only the text description of objects is used in the Cont-channel. It does not consider the influence of link attributes and only uses an attention mechanism to combine the two channels. SSDCC is our previously published model.

  • SSDCC-: The attention mechanism between the content and structure channels is not considered, and they are directly connected.

  • SS+C: A single channel model is used that only considers Cont-channel.

  • SS+S: A single channel model is used that only considers the Stru-channel.

  • Metapath2vec.apvpa [17]: We use the metapath2vec algorithm here and sample the path according to the metapath APVPA to obtain the embedding of the node. According to the Euclidean distance between the vectors of different authors, a similarity search is carried out.

  • Metapath2vec.apa [17]: We use the metapath APA for path sampling to obtain the embedding of nodes. Similarity search is carried out by Euclidean distance.

  • Doc2vec [24]: In this method, nodes are embedded according to the content information of the object and then the top-k similar objects are searched for each target node by using node vectors.

In practice, we make use of Keras for all compared algorithms, and ReLU is used by all neural nets in the hidden layers.

7.3 Experimental results and comparison

In AS-5, we analyse the data from 2010 to 2014, select 10 popular authors (e.g. Jack J. Dongarra, P S Yu, Elisa Bertino, and Jiawei Han) and label them with similar authors according to their research direction, paper organization and publication to test the accuracy of similarity search results. In the AS-small cohort, we randomly selected 10 authors who had published at least one paper and processed the data in the same way as in the AS-5. We label each result object with a relevance score as 0-not relevant and 1-relevant.

7.3.1 Search result analysis

The top-10 similarity search results on AS-small are shown in Table 3, and the results on AS-5 are shown in Table 4. The tables show the search results of four test subjects randomly selected on two datasets. The search results on AS-small are taken as examples.

W Pedrycz is a professor and Canada Research Chair in the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada. His main research directions involve computational intelligence, fuzzy modelling and granular computing, knowledge discovery and data mining. A series of similar authors, including Duoqian Miao, Young Chel Kwun, Jin Han Park, etc., are all well-known scholars in related research fields. From 2010 to 2014, they all participated in the same conference with W Pedrycz, copublished papers with W Pedrycz, or cited W Pedrycz’s (cited by W Pedrycz) paper. As seen from the search results in the table, our experimental results are very reasonable.

Table 3 Top-10 similarity search results of four popular authors on AS-small
Table 4 Top-10 similarity search results of four popular authors on AS-5

7.3.2 Overall comparison

We evaluated the precision and normalized discounted cumulative gain (NDCG) of the top-10 similarity search results and compared them with the baselines and the variants of our model, as shown in Table 5. (In Table 5, M. apvpa stands for Metapath2vec.apvpa, M. apa stands for Metapath2vec.apa and M. appa stands for Metapath2vec.appa.) It can be observed that the performance of SSDCC+ is better than that of SSDCC, which means that considering nodes’ attributes in content representation and paths’ attributes in dual channel fusion can effectively improve the accuracy of search results. The performance of SSDCC is better than that of SSDCC- because the use of an attention mechanism between the Cont-channel and Stru-channel is also beneficial to improve the accuracy. The results of SSDCC+, SSDCC and SSDCC- are also better than those of SS+S and SS+C, showing that the model’s performance can be improved by using the double channel convolutional neural network, which comprehensively considers the content information and structure information. As SSDCC+ considers multiple metapaths while Metapath2vec.apvpa, Metapath2vec.appa and Metapath2vec.apa are trained based on only one metapath, the performance of SSDCC+ and its variants are all better than that of the baselines. In addition, Doc2vec has the worst performance. The first reason is that the academic social network dataset is rich in structural information and relatively insufficient in content information. The second reason is that the Doc2vec model does not consider the attribute information of objects but only learns the representation of text descriptions.

Table 5 Comparison on precision and NDCG of top-10 similarity search results

7.3.3 Comparison of NDCG

To evaluate the performance of models under different k values, we plot the NDCG value with respect to the k values in Fig. 6a and b. The SSDCC+ model always achieves the best performance in all similarity search tasks. In most cases, the NDCG of SSDCC and SSDCC-, which consider both content information and structure information, is higher than that of other methods that only consider single-node information. Specifically, in AS-5, when k is equal to 7, the NDCG value of SSDCC+ is 0.5905, which is greater than 0.4385 of SSDCC, 0.4167 of SSDCC-, 0.3129 of SS+S, 0.25 of SS+C, 0.1667 of Metapath2vec.apa, 0.0594 of Metapath2vec.apvpa, and 0.0556 of Doc2vec. It shows that the output of SSDCCC+ is indeed optimal.

In addition, on AS-small and AS-5, the performance of Doc2vec is very poor, mainly because the nodes’ structural information in academic social networks is dense, while content information is scarce. Because only a single metapath is considered, the search performance of metapath2vec.appa and metapath2vec.apa on AS-small and that of metapath2vec.apvpa and metapath2vec.apa on AS-5 are relatively low.

7.3.4 Comparison of precision

The precision of SSDCC+ and the comparison model under different k values is shown in Fig. 6c and d. Compared with NDCG, the precision of similarity search results is relatively low. In any case, SSDCC+’s performance on the two datasets is obviously the best. Overall, the precision of the search results on AS-small is better than that on AS-5. This is mainly because the AS-small is smaller, which makes the candidates for manual labelling clearer, and the model training results are more accurate.

Moreover, in both AS-small and AS-5, when the number of k is less than 6, the precision of each model gradually becomes 0. The reason is that many methods cannot search for similar nodes.

Fig. 6
figure 6

Comparison with existing approaches for accuracy on k

7.4 Discussion

In this section, we will analyse and discuss the experimental results in more detail. First, according to the results in Table 5 and Fig. 6, it can be seen that models that only consider one metapath, such as Metapath2vec.apvpa, perform worse than models that consider multiple metapaths. Therefore, it is necessary to capture the semantic information of multiple metapaths at the same time. Second, many objects have text description information. The text description information of some objects can reflect most of their characteristics, such as paper, while that of many other objects contains insufficient characteristics, such as author. The performance of Doc2vec in Fig. 6 is very poor on both datasets, confirming the above phenomenon. Therefore, similarity search using text information is not applicable to some types of objects. Third, when the dataset is large, the accuracy of the model is poor. Obviously, applying the model to a small dataset will obtain better results. Finally, observing Fig. 6, when \(k \ge 6\), the performance on both datasets becomes acceptable. Therefore, it is better to provide users with more search results rather than fewer, such as fewer than 6 results.

8 Conclusion

In this paper, we provided a focused study on finding similar nodes in HINs, while related works include similarity search in homogeneous information networks and other downstream tasks, including search and recommendation in HINs. We analyse the problems of similarity search in HINs. To overcome this challenge, including adjusting different metapaths for different queries, flexibly integrating structure information and content information to comprehensively represent nodes, and considering the time factor and other data distribution information, we propose a model for top-k similarity search based on a double channel convolutional neural network on weighted HINs (SSDCC+), which uses a CNN network with two channels to train the content information and structure information for the first time. We generate different structure and content embeddings for each node according to different metapaths, which are not available in previous studies, and extract the internal attribute information of the node. SSDCC+ flexibly considers multiple metapaths between two nodes and uses an attention mechanism to obtain the structure and content representation of semantics between nodes. Then, we propose another attention mechanism, which considers the output results of the two channels and the attributes of the dataset to combine the node’s content and structure information for its comprehensive representation. The above two attention mechanisms fully realize the personalization of similarity search. Finally, the importance evaluation function we designed effectively improves the accuracy of similarity search results, makes the model more explainable, and improves the performance of SSDCC+.

The experimental results on a public dataset show that SSDCC+ is superior to other comparison models in the accuracy of similarity search, mainly because it considers multiple metapaths, content and structure information and has the advantages of personalization and explainability. In conclusion, it proves the effectiveness of our proposed model, and our method achieved higher performance than existing methods. In future work, we plan to automatically identify different metapaths, consider the dynamic characteristics of the network, and add more attribute relationships.