1 Introduction

Clustering is an unsupervised learning task whose aim is to partition the data into a number of homologous clusters. Traditional clustering methods typically provide a single clustering, and fail to reveal the diverse patterns underlying the data. In fact, because of the multiplexes of real world data, several different clusterings may co-exist in a given dataset, and each may provide a meaningful grouping of the data (Bailey 2013; Niu et al. 2013; Wang et al. 2018). For instance, a collection of films can be grouped based on genre, producer or on director; the authors in an academic network can be clustered by their research fields or by their organizations. These alternative clusterings are all meaningful. To mine the underlying structure of the data from different perspectives and present alternative clusterings of different aspects, the study of multiple clusterings has emerged during the last decade (Bailey 2013). Multiple clustering approaches not only focus on the quality of the clusterings, but also on their diversity. However, it is a known dilemma to reach a balance between quality and diversity (Bailey 2013; Wang et al. 2020).

Fig. 1
figure 1

An example of the academic heterogeneous network. The authors can be clustered according to alternative meaningful patterns (i.e., research areas and organizations)

The early approaches to multiple clusterings mainly focus on single-view data to generate alternative clusterings in non-redundant (independent) subspaces (Cui et al. 2007; Niu et al. 2013; Mautz et al. 2018; Wang et al. 2019b; Miklautz et al. 2020), by meta clustering of multiple base clusterings (Caruana et al. 2006), by reducing the redundancy with the already generated clusterings (Bae and Bailey 2006; Yang and Zhang 2017), or by simultaneously reducing the redundancy between all the to-be-generated clusterings (Wang et al. 2018; Yao et al. 2019b). Recently, some works extend multiple clusterings to multi-view data, which are naturally represented with heterogeneous feature views. For example, a film can be encoded by its audio, video and snapshots. Multi-view multiple clusterings (Yao et al. 2019a) first adapts self-representation learning (Luo et al. 2018) to extract the individuality and commonality information matrices of multi-view data, and then applies semi-nonnegative matrix factorization (Ding et al. 2010) on each combination of an individuality (for diversity) and the shared commonality (for quality) matrices to generate alternative clusterings. Deep matrix factorization based multi-view multiple clusterings (Wei et al. 2020b) factorizes the multi-view data matrices into multiple common subspaces layer-by-layer, and generates an alternative clustering per layer. Deep incomplete multi-view multiple clusterings (Wei et al. 2020a) seeks multiple clusterings by completing the missing data with multiple decoding networks. All these single-/multi-view multiple clusterings algorithms are designed for applications with i.i.d. data samples, and cannot handle the data samples with dependency presented with linkages in networks.

In the real world, many complex systems take the form of networks (Cui et al. 2019; Zhang et al. 2020; Yang et al. 2020), where the samples are nodes with dependency on others to some degree, reflected by the seen/unseen links between them. There are homogeneous networks with only one type of nodes and relationships. Yet a large number of networks are heterogeneous in nature, involving diverse types of node and/or relationships between nodes, such as social networks, biological networks, and academic networks. Traditional approaches perform clustering on networks by spectrum method (Mall et al. 2013; Li et al. 2019), by ranking learning (Sun et al. 2009; Chen et al. 2015), by matrix factorization method (Lin et al. 2016), by hierarchical approach (Pio et al. 2018), or by representation learning (Perozzi et al. 2014; Grover and Leskovec 2016; Dong et al. 2017). These methods only focus on a single clustering assignment for networks, while in reality a node in networks usually has multiple clustering assignments. As illustrated in Fig. 1, nodes (e.g., authors) in an academic publication network can be grouped by their research fields and by their organizations. Ensemble clustering also aims to generate diverse base clusterings to reach a consolidated clustering, but it still targets to output a single clustering. In addition, these base clusterings are often with high redundancy.

Several attempts have been made to generate multiple vector representations for each network node. Splitter (Epasto and Perozzi 2019) splits original network into multiple ego-networks and then learns a new representation from each ego-network. MNE (Yang et al. 2018) factorizes the network proximity matrix into several groups of embedding matrices to generate different representations. These two methods are designed for homogeneous network. ASPEM (Shi et al. 2018a) decomposes a HIN into multiple aspects and then learns representations of HIN. HEER (Shi et al. 2018b) embeds HIN via edge representations. But HEER defines the aspects based on the edges with predefined ground-truth labels, which are not always available in reality. Although these multi-facet network embedding methods can produce multiple representations, in which multiple clustering results can be generated consequently, they are not designed for multiple clusterings. Furthermore, they suffer the optimization inconsistency and the difficulty of redundancy control between alternative clusterings.

We propose an approach called NetMCs to explore multiple clusterings in HIN. NetMCs first adopts a set of meta-path schemes with different semantics on HIN and considers each meta-path scheme as a base clustering aspect. Then, guided by the set of meta-path schemes, NetMCs introduces a variation of the skip-gram framework that can jointly optimize multiple clustering aspects, and obtain respective embedding representations and individual clusterings therein. In addition, NetMCs explicitly controls the embedding diversity of the same node between different clustering aspects, and thus enhances the diversity between alternative clusterings. The main contributions of our work are summarized as follows:

  1. (i)

    To the best of our knowledge, NetMCs is the first effort to generate multiple clusterings with quality and diversity from a heterogeneous information network, which is an important and practical topic, but quite challenging and mostly overlooked by previous solutions.

  2. (ii)

    NetMCs introduces a variation of the skip-gram model to jointly optimize different clustering aspects to learn multiple diverse embeddings, and to generate multiple clusterings therein. The corresponding optimization procedure for the variation is also presented. As a result, NetMCs addresses the optimization inconsistency between representation learning and clustering, and can generate multiple clusterings of quality.

  3. (iii)

    NetMCs introduces the redundancy terms that can simultaneously minimize the overlap between embeddings and place nodes with similar embeddings into the same cluster, and therefore generates alternative clusterings of diversity.

  4. (iv)

    Experiments on real-world datasets and visualization examples demonstrate the effectiveness of NetMCs on mining multiple clusterings on heterogeneous networks.

2 Related works

Our work has close connections with three lines of related works, that is multiple clusterings, network clustering and multi-facet network embedding learning.

Multiple clusterings focuses on how to generate different clusterings with both high quality and diversity from the same dataset (Bailey 2013). It is less well studied than single/multi-view clustering and ensemble clustering (Jain 2010; Zhou 2012), especially network clustering, because of its demand on generating multiple groups of clusters, and the difficulties on well tradeoff the quality and diversity at the mean time. Based on hierarchical clustering, Bae and Bailey (2006) presented a multiple clusterings solution (COALA). The main idea of COALA is that instances with higher intra-class similarity in the first clustering still gather in the same cluster, while those with lower intra-class similarity are considered to be placed into different clusters for the next clustering. Jain et al. (2008) proposed Dec-kmeans to find multiple sets of mutually orthogonal cluster centroids and then generate diverse clusterings based on these centroids. Different from COALA and Dec-kmeans that directly control the diversity between clustering assignment, other solutions control the diversity between clustering subspaces and then generate different clusterings in these subspaces. Cui et al. (2007) projected the data matrix into orthogonal subspaces to get different feature representations and then found alternative clusterings in these subspaces. Mautz et al. (2018) also attempted to explore multiple mutually orthogonal subspaces, along with the optimization of classical k-means objective function, to find non-redundant clusterings. However, the orthogonal constraint is too strict to generate more than two clusterings. Wang et al. (2019b) generated multiple independent subspaces with semantic interpretation via independent subspace analysis and minimum description length, and then performed kernel matrix factorization-based clustering in these subspaces to explore diverse clusterings. Miklautz et al. (2020) combined the benefits of a deep neural network-based non-linear feature transformation with a non-redundant clustering objective to gain alternative clusterings. Yang and Zhang (2017) explicitly introduced a regularization term to quantify and minimize the redundancy between the already generated clustering and the to-be-generated one, and then plugged this regularization as the constraint of the next clustering based on nonnegative matrix decomposition (Ding et al. 2010) to find another clustering. Wang et al. (2018) and Yao et al. (2019b) directly reduced the redundancy between all the to-be-generated clusterings to simultaneously find all clusterings. Besides, Caruana et al. (2006) firstly generated a number of high-quality clusterings, and then grouped these clusterings at the meta-level, and thus allowed the user to select desired non-redundant clusterings for application. These multiple clusterings methods are designed only for single-view data.

Given the multiplicity of multi-view data, it is desirable but more difficult to generate multiple clusterings from the same multi-view data. Three approaches have been proposed for attacking this challenging task. MVMC (Yao et al. 2019a) first explores multiple clusterings on multi-view data by mining the individuality information encoding matrices and the commonality information matrix shared across views by self-representation learning (Luo et al. 2018). It employs each individuality similarity matrix and the commonality similarity to generate a distinct clustering by matrix factorization-based clustering. However, given the cubic time complexity of the self-representation learning, MVMC can hardly be applicable on datasets with a large number of samples. To alleviate this drawback, DMClusts extends the deep matrix factorization (Trigeorgis et al. 2016) to collaboratively factorize the multi-view data matrices into multiple representational subspaces layer-by-layer, and seeks a different clustering of high quality per layer. In addition, it introduces a new balanced redundancy quantification term to enhance the diversity among these clusterings, and thus reduces the overlap between the produced clusterings. DiMVMC (Wei et al. 2020a) considers the incomplete multi-view data, which achieves the completion of data view and multiple shared representations simultaneously by optimizing multiple groups of decoder deep networks. It further minimizes a redundancy term to simultaneously control the diversity among these representations and among parameters of different networks to generate individual clusterings.

All these single-/multi-view multiple clusterings algorithms are vector-based methods, which assume that data samples can be directly represented by independent feature vectors. As such, they cannot be applied directly for extensive network data with inter-dependence. With the increase of network data, many network clustering works have been proposed. To name a few, Li et al. (2019) formulated the similarity matrix construction as an optimization problem and applies spectral clustering to network data. Lin et al. (2016) transformed different types of relations into a group of matrices and subsequently combined with a greedy search approach. Chen et al. (2015) proposed a probabilistic generative model that simultaneously achieves clustering and ranking on a heterogeneous network with arbitrary network schema. Zhou et al. (2019) proposed a recurrent meta-structure based framework to measure the similarity between nodes by integrating all the meta-paths and meta-structures, and applied it for clustering and ranking task on HIN. HENPC (Pio et al. 2018) extracts possibly overlapping and hierarchically-organized heterogeneous clusters and exploits them for predictive purposes, and it can take into account autocorrelation of HIN at different levels of granularity. Random walk based methods (Perozzi et al. 2014; Grover and Leskovec 2016; Dong et al. 2017) firstly obtain a pile of random walk sequences and subsequently put them into skip-gram framework to learn the node embedding, along with traditional clustering algorithms on the embeding. These methods only generate a single embedding or clustering for one network and lack of consideration for redundancy control.

Some multi-facet network embedding methods have been introduced, most of which focus on generating multiple vector representations for each network node. Based on a principled decomposition of the ego-network, Splitter (Epasto and Perozzi 2019) splits each node into multiple representations by performing local graph clustering, each representation encodes the role of the node in a different local community in which the nodes participate. MNE (Yang et al. 2018) factorizes the network proximity matrix into several groups of embedding matrices, and adds a diversity constraint to force different matrices focusing on different aspects of nodes. ASPEM (Shi et al. 2018a) decomposes a HIN into multiple aspects before learning embedding to obtain quality representations of HIN. HEER (Shi et al. 2018b) embeds HIN via edge representations that are further coupled with properly-learned heterogeneous metrics to capture the incompatible semantics of HIN.

Although these network embedding methods can generate different representation vectors, they cannot produce clustering results directly and thus suffer the optimization inconsistency between representation learning and clustering. Furthermore, they neglect the redundancy control among clusterings. To address these issues, our NetMCs generate diverse HIN node embeddings by different meta-path based semantics and alternative clusterings based on these embeddings in a coherent way.

3 Our method

3.1 Preliminaries

For problem formulation, we first give related concepts and notations.

Definition 1

(HIN) A Heterogeneous Information Network is defined as a graph \({\mathcal{G}}=({\mathcal{V}}, {\mathcal{E}}, {\mathcal{T}})\), in which each node v and each edge e are associated with their mapping functions \(\phi (v): {\mathcal{V}}\rightarrow {\mathcal{T}}_V\) and \(\psi (e): {\mathcal{E}}\rightarrow {\mathcal{T}}_E\) respectively. \({\mathcal{T}}_V\) and \({\mathcal{T}}_E\) denote the sets of nodes and edge types, where \(|{\mathcal{T}}_V|+|{\mathcal{T}}_E|\>2\).

Definition 2

(Meta-path) a meta-path scheme \({\mathcal{P}}\) is defined as a path in the form of \(V_{1}{\mathop {\longrightarrow }\limits ^{R_1}}V_{2}{\mathop {\longrightarrow }\limits ^{R_2}} \cdots V_{t}{\mathop {\longrightarrow }\limits ^{R_t}}V_{t+1} \cdots {\mathop {\longrightarrow }\limits ^{R_{l-1}}}V_{l}\), wherein \(R_i \in {\mathcal{T}}_E\) represents an edge type and \(R=R_{1}\circ R_{2}\circ \cdots R_{l-1}\) defines the composite relations between node types \(V_1\) and \(V_l\).

Definition 3

(Multiple clusterings on a HIN) Given a HIN \({\mathcal{G}}=({\mathcal{V}}, {\mathcal{E}}, {\mathcal{T}})\), a set of meta-paths \(\mathcal{PS}=\{{\mathcal{P}}_{m}\}_{m=1}^{M}\), and a set of the number of clusters \({\mathcal{K}}=\{k_{m}\}_{m=1}^{M}\), the problem of multiple clusterings on \({\mathcal{G}}\) is to partition nodes into M diverse clustering patterns. The nodes in the m-th clustering are partitioned into \(k_m\) disjoint clusters \({\mathcal{C}}_{m}=\{{\mathcal{C}}_{m,1}, \ldots , {\mathcal{C}}_{m,k_{m}}\}\).

3.2 The proposed methodology

Random walk is a powerful approach for capturing information in networks, and extensively used for network embedding (Perozzi et al. 2014; Grover and Leskovec 2016; Dong et al. 2017). Considering its efficiency and effectiveness in handling large-scale networks, we adopt a random walk approach as the base of NetMCs. Thus, we briefly introduce random walk-based network embeddings first.

3.2.1 Generating node embeddings

Given a text corpus, Mikolov et al. (2013) proposed the word2vec to learn the distributed representations of words in the corpus. Inspired by it, previous network representation learning methods, such as DeepWalk (Perozzi et al. 2014) and node2vec (Grover and Leskovec 2016) view nodes in a graph as words in a corpus. Specifically, they both perform random walks on a graph and obtain a set of truncated random walks sequences \({\mathcal{W}}\). Then, the skip-gram model is introduced to learn the representation of each node. The objective is to maximize the likelihood of the context given the target node:

$$\begin{aligned} \mathop {max}\prod _{u\in {\mathcal{V}}}\prod _{v\in {\mathcal{N}}(u)}p(v|u) \end{aligned}$$
(1)

where \({\mathcal{N}}(u)\) denotes the neighbors of target node u and p(v|u) is commonly defined as a softmax function as:

$$\begin{aligned} p(v|u)=\frac{\exp (\langle {\mathbf{h}}_{u}, {\mathbf{h}}_{v}\rangle )}{\sum _{v_{'}\in {\mathcal{V}}}\exp (\langle {\mathbf{h}}_{u}, {\mathbf{h}}_{v_{'}}\rangle )} \end{aligned}$$
(2)

where \({\mathbf{h}}_{u}\) represents the embedding vector for node u, \({\mathbf{h}}_{v}\) represents the context embedding vector for node v and \(\langle {\mathbf{h}}_u, {\mathbf{h}}_v\rangle\) means the inner product of \({\mathbf{h}}_{u}\) and \({\mathbf{h}}_{v}\). By maximizing Eq. (1), nodes that frequently appear together within a context window will be trained to have similar embeddings. Equation (1) is limited to homogeneous networks. In other words, it cannot leverage the information of multi-type of nodes in HIN.

Meta-path can capture the structural and semantic correlations between different types of nodes in HIN (Sun and Han 2013). It is defined by a sequence of relations in the network, and can be described by a sequence of object types when there is no ambiguity. For example, A-O-A in Fig. 1 is a meta-path denoting the colleague relation between authors, and A-P-C is a meta-path denoting the publication relationship between the authors and conferences. metapath2vec (Dong et al. 2017) introduces a heterogeneous skip-gram model to perform meta-path based random walk in heterogeneous networks to learn the representation of the nodes. Formally, given a heterogeneous network \({\mathcal{G}}=({\mathcal{V}}, {\mathcal{E}}, {\mathcal{T}})\) with \(|{\mathcal{T}}_V|>1\), metapath2vec learns the representation of node u by maximizing the probability of having the heterogeneous context \({\mathcal{N}}_{t}(u)\) (\(t\in {\mathcal{T}}_V\)) generated by meta-path-based random walk:

$$\begin{aligned} \mathop {max}\sum _{u\in {\mathcal{V}}}\sum _{t\in {\mathcal{T}}_V}\sum _{v_{t}\in {\mathcal{N}}_{t}(u)}\log p(v|u) \end{aligned}$$
(3)

where \({\mathcal{N}}_{t}(u)\) includes u’s neighbors with the t-th type of nodes.

For the work in this paper, we follow the concept of meta-path to describe the possible relations that can be derived from a heterogeneous network between different types of objects in a meta level. Most meta-path based methods usually use only one meta-path in a specific task. However, the semantic included in a single meta-path is too limited to mine the abundant information in HIN. To solve this problem, Zhao et al. (2017) introduced the concept of meta-graph which is an extension of meta-paths. Compared with the sequence structure of meta-path, meta-graph does not restrict the intermediate linked structure between the source and target nodes. Wang et al. (2019a) merged multiple meta-paths to learn the node similarity via weighting the meta-paths. Nevertheless, while both the meta-graph and merged meta-paths strive for richer semantic information, they may encounter the semantic incompatibility problem, due to heterogeneity of HIN (Shi et al. 2018b). Semantic incompatibility refers to the semantic inconsistency of meta paths with respect to the same node. If the semantics of these meta-paths are inconsistent, then projecting the related nodes into a uniform embedding space will result in the degradation of the embedding vectors. For example, Bob likes both musics and movies directed by Nolan. If these nodes were embedded to one metric space, Bob would be close to neither musics nor Nolan due to the dissimilarity between musics and Nolan, which results in information loss. Therefore, we adopt multiple different meta-paths to perform random walks on HIN to capture diverse semantic information. Specifically, guided by a set of diverse meta-paths \(\mathcal{PS}\), we implement random walks on HIN. Each meta-path \({\mathcal{P}}_m\in \mathcal{PS}\) can generate truncated random walk sequences \({\mathcal{W}}_{{\mathcal{P}}_m}\), which are subsequently used as the input for the variant skip-gram framework. Thus, NetMCs generates an embedding for nodes in each selected meta-path as follows:

$$\begin{aligned} max \sum _{{\mathcal{P}}_m\in \mathcal{PS}}\sum _{u\in {\mathcal{W}}_{{\mathcal{P}}_m}}\sum _{v\in {\mathcal{N}}_{m}(u)}\log \frac{\exp (\langle {\mathbf{h}}_{u}^{m},{\mathbf{h}}_{v}^{m}\rangle )}{\sum _{v^{\prime}\in V}\exp ( {\mathbf{h}}_{u}^{m},{\mathbf{h}}_{v^{\prime}}^{m}\rangle )} \end{aligned}$$
(4)

where \({\mathbf{h}}_{u}^m\in {\mathbb {R}}^{d}\) is the embedding vector of node u in terms of the meta-path \({\mathcal{P}}_m\), and \({\mathcal{N}}_{m}(u)\) denotes the nodes within the context window of u. By optimizing the above equation, we can obtain individual embeddings of each node under different semantics guided by diverse meta-paths.

3.2.2 Enhancing diversity and quality

However, the semantics of different meta-paths may overlap. for instance, A-O-A and A-P-A may have the same semantics to some extent, because authors in the same institution are more likely to co-author, which may cause similar clustering results. We expect embeddings induced from different meth-paths being different. Thus, we introduce a regularization term to quantify and minimize the overlap between embedding vectors of the same node by their similarity of probability distributions. In addition, NetMCs should make each embedding space having different semantic information as much as possible and capture information from other embedding spaces also. In this way, a clustering is different from others by reducing the redundancy between embedding spaces respectively induced by individual meta-paths. For M different embeddings of node u, the regularization term is defined as:

$$\begin{aligned} min \sum _{m^{\prime}=1,m^{\prime}\ne m}^{M}\log p({\mathbf{h}}_u^{m^{\prime}}|{\mathbf{h}}_u^{m}) \end{aligned}$$
(5)

We can now jointly optimize Eqs. (4) and (5) to obtain multiple embeddings of the same node in the HIN and then execute clustering in the respective embedding spaces. However, as we pointed out, the sequential embedding and clustering paradigm may lead to optimization inconsistency, due to the distinct goals of embedding and clustering (this is confirmed in our experiments). Thus, we try to achieve multiple embeddings and multiple clusterings therein simultaneously by employing a transformation \(f({\mathbf{h}}_u^{m})\) that generates a node’s clustering assignment from its embedding. To keep the model from getting too complicated, we simply adopt \({\mathbf{z}}_u^{m}=softmax({\mathbf{Q}}_{m}{\mathbf{h}}_u^{m})\) in this paper, where \({\mathbf{z}}_{u}^m\in {\mathbb {R}}^{k_m}\) is the soft assignment vector of node u in the m-th clustering, and \({\mathbf{Q}}_m \in {\mathbb {R}}^{k_m\times d}\) is a transformation matrix for the m-th clustering aspect. Because each meta-path can generate an embedding space and reflect a semantic pattern of HIN, we generate a clusteing result in each embedding space. Thus, the number of clusterings M is equal to the number of meta-paths.

In addition, we assume that the assignment vectors of similar embeddings should be similar. Cross entropy is a common evaluation index in multi-classification tasks; it can measure the difference between two probability distributions. Therefore, we can naturally adopt the cross entropy metric to measure the similarity between different soft assignment vectors. Given a node u and its neighborhood v in the m-th clustering aspect, the cross entropy is minimized as:

$$\begin{aligned} min\quad CE({\mathbf{z}}_u^{m},{\mathbf{z}}_v^{m})=&-\sum _{i=1}^{k_m}({\mathbf{z}}_u^{m})_i\log ({\mathbf{z}}_v^{m})_i \end{aligned}$$
(6)

where \(({\mathbf{z}}_u^{m})_i\) is the i-th element of \({\mathbf{z}}_{u}^m\), and \(v\in {\mathcal{N}}_{m}(u)\). By minimizing the cross entropy between the embedding vectors with high similarity, NetMCs can constrain these nodes also having similar clustering assignment and thus improve the clustering quality.

3.2.3 Unified objective

By integrating the above objectives we can define the comprehensive loss function of NetMCs. The optimization of Eq. (4) is computationally expensive, which requires the summation over the entire set of vertices in \(\{{\mathcal{W}}_{{\mathcal{P}}_m}\}_{m=1}^M\) when calculating the softmax function. In addition, the skip-gram model has a very large number of parameters. As a result, it is time-demanding to train such a large neural network using gradient descent. Furthermore, we need a lot of training data to adjust these weights to avoid over-fitting. To address this problem, we adopt the negative sampling strategy with linear-time computation (Mikolov et al. 2013) that samples multiple negative nodes according to some noisy distribution for each target node. The sampling distribution is proportional to the 3/4 power of node degree. Therefore, in each iteration, we only need to care about the positive sample, its neighbor nodes, and its negative samples. In this way, the comprehensive loss function of NetMCs is defined as follows:

$$\begin{aligned} &{\mathcal{J}}=\sum _{m=1}^{M}\sum _{u\in {\mathcal{W}}_{{\mathcal{P}}_m}}{\mathcal{J}}_u^{m}({\mathbf{h}}_u^{m};{\mathbf{z}}_u^{m}),\\&\text{where} \quad {\mathcal{J}}_u^{m}({\mathbf{h}}_u^{m};{\mathbf{z}}_u^{m})=\log \delta (({{\mathbf{h}}_u^{m}})^{T}{\mathbf{h}}_v^{m})+\sum _{j=1}^{b}{\mathbb {E}}_{v^{\prime}_j}[\log \delta (-({{\mathbf{h}}_u^{m}})^{T}{\mathbf{h}}_{v^{\prime}_j}^{m})]\\&+\lambda \sum _{m^{\prime}=1,m^{\prime}\ne m}^{M}\log \delta (-({{\mathbf{h}}_u^{m}})^{T}{\mathbf{h}}_u^{m^{\prime}})-CE({\mathbf{z}}_u^{m},{\mathbf{z}}_v^{m}) \end{aligned}$$
(7)

where \(\lambda\) is a trade-off parameter that controls the diversity extent of the embeddings of the same node in different embedding spaces, \(\delta\) is the Sigmoid function, and b is the number of negative samples. The first term forces the embedding representations of u and its context v to be similar, while the second term is to force \({\mathbf{h}}_u^{m}\) and the embeddings of negative samples \({\mathbf{h}}_{v^{\prime}_j}^{m}\) in the same clustering aspect to be different. The third term seeks different embedding representations of the same nodes for different clustering aspects, and the last term forces nodes with similar embeddings to be placed into the same cluster via the classical cross entropy. By maximizing the objective loss, NetMCs can generate multiple embeddings and multiple clusterings of nodes therein.

figure a

Algorithm 1 gives in details how our method works. NetMCs can be divided into preprocessing phase (MetaPathBasedRandomWalk) and training phase (VariantSkipGram). First of all, some parameters, such as scalar parameter \(\lambda\), a set of meta-paths \(\mathcal{PS}=\{{\mathcal{P}}_{m}\}_{m=1}^{M}\) and walks per node w, are initialized. Next, NetMCs executes meta-path based random walk on the input HIN, guided by the set of predefined meta-paths \(\mathcal{PS}=\{{\mathcal{P}}_{m}\}_{m=1}^{M}\), and \({\mathcal{P}}_{m}\) represents a meta-path schema in \(\mathcal{PS}\). NetMCs starts random walks from the node whose type accords with the head of \({\mathcal{P}}_{m}\). Then the flow of the walker is conditioned on the predefined meta-path \({\mathcal{P}}_{m}\). When this preprocessing phase is over, M sets of random walk sequences MP are generated and then are sent into the next training phase. In the training phase, NetMCs firstly initializes node embeddings \({\mathcal{H}}^m\) and clustering assignment \({\mathcal{Z}}^m\). After that, NetMCs samples target node u successively as well as its neighbor nodes v and negative nodes \(v^{\prime}\) from random walk sequences MP as a tuple, along with the loss defined in Eq. (7) to optimize the variant skip-gram model to generate the diverse node embeddings \({\mathcal{H}}^m\) and clustering results \({\mathcal{Z}}^m\).

4 Experimental results and analysis

In this section, we evaluate the effectiveness and efficiency of our proposed NetMCs on mining multiple clusterings on real-world HIN datasets.

4.1 Experimental setup

Datasets We use three publicly available real-world HIN datasets: DBLP, IMDb, and YAGO. DBLP is a bibliographical network in the computer science domain. We use its subnetwork collected by Lu et al. (2019), which contains four types of nodes: author (A), paper (P), venue (V), and term (T). The edge types include authors writing papers, papers published in venues, and papers belonging to terms. IMDb is a HIN built by linking the movie-attribute information from IMDb and the user-reviewing-movie relationship from MovieLens-100K. There are five types of nodes in the network: user (U), movie (M), actor (A), director (D), and genre (G). The edge types include: user reviewing movie, actor featuring in movie, director directing movie, and movie being of genre. YAGO (Suchanek et al. 2007) is a knowledge graph derived from merging Wikipedia, GeoNames and WordNet. YAGO dataset consists of 7 types of nodes: person (P), organization (O), location (L), prize (R), work (W), position (S), event (E), and 24 edge types. The statistics of these datasets are listed in Table 1.

Table 1 Statistics of DBLP, IMDb and YAGO networks

Baselines We compare NetMCs against four recent multiple clusterings methods and four multi-facet network representation learning methods. MNMF (Yang and Zhang 2017) is a multiple clusterings solution based on vector data, it defines a regularization term to quantify and minimize the redundancy between the already generated clusterings and the to-be-generated one. Nr-kmeans (Mautz et al. 2018) tries to explore multiple mutually orthogonal subspaces from vector data, along with the optimization of k-means objective function, to find non-redundant clusterings. MVMC (Yao et al. 2019a) mines common and specific information of multi-view data with self-representation learning to achieve multiple clusterings. DMClusts (Wei et al. 2020b) employs deep matrix factorization and redundancy control to generate multiple subspaces from layer-wise and obtains different clustering results therein. Splitter (Epasto and Perozzi 2019) is the only compared method designed for homogeneous networks, while the others (metapath2vec (Dong et al. 2017), ASPEM (Shi et al. 2018a), HEER (Shi et al. 2018b)) are applied for heterogeneous networks. These multi-facet network embedding methods were introduced in the Introduction.

For all the embedding methods, we set the embedding dimension d to 128. For the random walk based methods (metapath2vec, Splitter and NetMCs), we use the following parameter values, number of walks per node w: 20; walk length l: 30; neighborhood size c: 3; and size of negative samples b: 5. The other input parameters of the compared methods are fixed (or optimized) as the authors suggested in their papers or shared codes. For MNMF and Nr-kmeans, we take each row vector of the adjacency matrix of a HIN as the input feature vector of the single view compared methods. For MVMC and DMClusts, we adopt different node similarity matrices as data views, following the work in Pathsim (Sun et al. 2011) using different meta-paths of HIN. For the sake of fair comparison, the selected meta-path is the same as the following experimental setup. The codes of NetMCs are available at our website.Footnote 1

Evaluation metrics Multiple clusterings approaches aim to generate diverse clusterings of good quality. To measure quality, we use the Silhouette Coefficient (SC) and the Dunn Index (DI) as internal indices to quantify the compactness and separation of clusters. To measure redundancy, we use the Normalized Mutual Information (NMI) and the Jaccard Coefficient (JC) as external indices to quantify the similarity between clusters of two clusterings. These metrics have been extensively used in the multiple clusterings literature (Yang and Zhang 2017; Yao et al. 2019a; Wang et al. 2020). The formal definitions of these metrics are given as below.

Silhouette Coefficient is the mean silhouette value over all samples. The silhouette value of each sample is a measure of how similar the sample is to the points in its own cluster, when compared to the samples in other clusters. SC is computed as follows:

$$\begin{aligned} \text{SC}({\mathcal{C}}) = \frac{1}{N}\sum ^{N}_{i=1}\frac{f(i)-a(i)}{\max {\{a(i),f(i)\}}} \end{aligned}$$
(8)

where N is the number of samples, a(i) is the average distance of the i-th sample to the other points in the same cluster, and f(i) is the minimum average distance of the i-th sample to the points in a different cluster, minimized over the clusters.

Dunn Index measures the ratio between the minimum distance of two arbitrary clusters and the maximum inter-cluster distance. DI is defined as follows:

$$\begin{aligned} \text{DI}({\mathcal{C}}) = \frac{\min _{i\ne j}\{\delta (c_i,c_j)\}}{\max _{1\le l\le k}\{\varDelta (c_l)\}} \end{aligned}$$
(9)

where \(\delta (c_i,c_j)\) is the cluster-to-cluster distance of pairwise clusters and \(\varDelta (c_l)\) is the cluster diameter measure.

Normalized Mutual Information measures diversity based on the ratio of joint entropy and individual entropy of clusterings \({\mathcal{C}}\) and \(\mathcal{C}^{*}\). NMI is computed as follows:

$$\begin{aligned} \text{NMI}({\mathcal{C}},\mathcal{C}^{*}) = \frac{\sum _{i=1}^{k}\sum _{j=1}^{k^*}n_{ij}\log \frac{n_{ij}}{n_{i}n_{j}}}{\sqrt{{(\sum _{i=1}^{k}n_{i}\log \frac{n_i}{n})(\sum _{j=1}^{k^*}n_{j}\log \frac{n_j}{n})}}} \end{aligned}$$
(10)

where \(n_i\) denotes the number of samples in cluster \({\mathcal{C}}_i\), \(n_j\) is the number of points in \({\mathcal{C}}_j^*\), and \(n_{ij}\) denotes the number of data in the intersection of clusters \({\mathcal{C}}_i\) and \({\mathcal{C}}_j^*\).

Jaccard Coefficient measures the overlap between two different clusterings based on ‘pair-counting’ as follows:

$$\begin{aligned} \text{JC}({\mathcal{C}},\mathcal{C}^{*}) = \frac{n_{11}}{n_{11}+n_{10}+n_{01}} \end{aligned}$$
(11)

where \(n_{11}\) is the number of pairwise samples that are in the same cluster in both \({\mathcal{C}}\) and \(\mathcal{C}^{*}\); \(n_{00}\) measures the number of pairs that are in different clusters in both \({\mathcal{C}}\) and \(\mathcal{C}^{*}\); \(n_{01}\) and \(n_{10}\) are the numbers of pairs in the same cluster in one of \({\mathcal{C}}\) and \(\mathcal{C}^{*}\), but not in the other.

We want to remark that higher values of SC and DI signify a clustering of higher quality. On the other hand, smaller values of NMI and JC imply that two clusterings have a smaller redundancy, or a higher diversity. We take the average SC and DI of multiple clusterings, and the average NMI and JC of pairwise clusterings as the evaluation results.

4.2 Discovering multiple clusterings in HIN

For the first experiment, we set the number of clusterings \(M=2\). The nodes in DBLP are connected with 20 conferences, and the nodes in IMDb are connected with 23 movie genres. In addition, we extract nodes that are related to 10 prizes on YAGO dataset. In this way, we fix the number of clusters of individual clusterings on DBLP, IMDb and YAGO as 20, 23 and 10, respectively. Such configuration of the number of clusters is motivated by two factors. First, they are widely used clustering ways for these public heterogeneous network datasets. Second, it is necessary to unify the number of clusters for the computation of evaluation metrics and for quantitative comparison. The parameter \(\lambda\) of NetMCs is chosen from \(10^{-3}\) to \(10^{3}\). Because these compared methods (except MNMF, Nr-kmeans and MVMC) can not directly give the clustering results, we employ k-means to generate individual clusterings from respective embeddings of these methods. Since how to select meta-paths is a non-trivial work in heterogeneous network analysis and there is currently no principle way for selecting meta-paths (Meng et al. 2015; Zhou et al. 2019), we choose the meta-paths widely used in past works (Sun et al. 2013; Dong et al. 2017). Particularly, for DBLP we select two candidate meta-paths (A-P-C-P-A and A-P-T-P-A) for all applicable methods. The candidate meta-paths for IMDb are U-M-G-M-U and U-M-U; for YAGO are P-L-P, P-R-P and P-O-L-O-P. If there are many valid meta paths, we can just use some of them, choose from them by reducing the redundancy, or by referring to the clustering results in a wrapper way. We report the average results and standard deviations of ten independent runs of each method on generating two clusterings on the same datasets, and report the results in Table 2. ‘N/A’ means no experimental results, since Nr-kmeans can only produce one clustering on the YAGO dataset. In the experiments, we focus on the clustering of authors by attended conferences or paper topics in DBLP, of users by geners or taste for movies in IMDb, and of persons by awarded prizes or locations in YAGO. We can make the following observations:

  1. (i)

    Quality of multiple clusterings NetMCs often obtains a better quality than the other methods. Both MNMF and Nr-kmeans target at multiple clusterings on vector data, they suffer the difficulty to find subspaces of good quality from sparse network data. Although different feature views of the HIN were generated for multi-view multiple clusterings algorithms, our NetMCs frequently outperforms MVMC and DMClusts, which are less capable to capture semantic information and nonlinear structure of network data. As such, traditional vector-based multiple clusterings methods have a lower quality than NetMCs. This fact suggests the necessity of developing the network-based multiple clusterings solution. The other four compared methods first seek multiple embeddings and then generate alternative clusterings in the embedding spaces. They also lose to NetMCs, which suggests that the joint optimization of embeddings and clusterings is necessary. In other words, the stage-wise approaches suffer from the optimization inconsistency between the sequential embedding and clustering steps. We observe that NetMCs sporadically has lower SC and DI values than some of the compared methods. This is due to the widely-recognized dilemma of obtaining alternative clusterings with both high diversity and high quality. Overall, the results prove the effectiveness of our proposed unified framework to generate multiple clusterings of quality.

  2. (ii)

    Diversity of multiple clusterings Both NMI and JC are canonically used to measure the diversity of multiple clusterings. The two clusterings generated by NetMCs often have a lower redundancy (higher diversity) than those generated by the compared methods. This is because NetMCs introduces a regularization term to specifically control the redundancy. Although MNMF considers the redundancy, it cannot control the clusterings’ diversity well because of the sparseness and high dimensionality of the node feature vectors. Nr-kmeans does not explicitly consider the redundancy. MVMC and DMClusts also emphasize on redundancy control, so their diversity is relatively lower compared with other methods. Splitter cannot distinguish different types of nodes to eliminate semantic redundancy. metapath2vec cannot control the semantic redundancy between different meta-paths, and it can only utilize a single meta-path each time. ASPEM and HEER consider the inconsistency between different types of relations, but they cannot capture the diverse semantic information as meta-path based methods. For these reasons, the multiple clusterings generated by the compared methods have a lower diversity.

Besides the pairwise t-test, we further applied the nonparametric Wilcoxon signed-rank test to check the difference between NetMCs and other compared methods, all the p-values are smaller than 0.01. In conclusion, NetMCs outperforms the other methods across the benchmark HIN datasets on generating multiple clusterings in terms of quality and diversity.

Table 2 Quality and Diversity of the various compared methods on generating multiple clusterings in HIN

4.3 Ablation study

We perform ablation study to investigate the contribution factors of NetMCs and report the results in Table 3. For this investigation, we introduce two variants of NetMCs. NetMCs-nRC means NetMCs without the redundancy control term (the third term in Eq. (7)). NetMCs-nCE means NetMCs without the cross entropy term (the last term in Eq. (7)). Because NetMCs-nCE can not directly give the clustering results, we perform k-means on the node embeddings generated by NetMCs-nCE. We can see that NetMCs-nRC often performs better than NetMCs in terms of the quality (SC and DI), but loses to NetMCs in terms of the diversity (NMI and JC) between clusterings. This result proves the effectiveness of the redundancy control term on improving the diversity, but it comes at the expense of compromising the quality, which coincides with known dilemma of quality and diversity of multiple clusterings. On the other hand, NetMCs always has a higher quality than NetMCs-nCE, while maintains a comparable diversity. This observation confirms that optimizing the objective of multi-facet embeddings and of multiple clusterings in a unified framework can alleviate the optimization inconsistency problem and produce alternative clusterings of better quality.

Table 3 Comparison results of NetMCs and its variants
Fig. 2
figure 2

Three alternative clusterings generated by NetMCs on the YAGO dataset. The various clusters in each clustering are distinguished by different node colors. The labels PE, PR, AD, and AS on the nodes represent different node types (person, prize, location, and organization)

We also visualize three alternative clustering results discovered by NetMCs on the YAGO dataset in Fig. 2. The considered meta-paths are P-L-P, P-R-P and P-O-L-O-P. Different clusters in each clustering are distinguished by different node colors. From Fig. 2, we can see that the persons in three subgraphs are clustered by the prize they win, by their locations and by their affiliated organizations. All these three clusterings for the same set of persons are with different semantics but meaningful, which signifies the capability of NetMCs on generating more than two alternative cluterings. In addition, NetMCs can not only cluster a single type of nodes of a HIN, but also other types of nodes included in the meta-path, such as prizes in Fig. 2a, locations in Fig. 2b and organizations in Fig. 2c. As a result, NetMCs can generate multiple clusterings for more than one node type in HIN simultaneously. Finally, the semantics of different meta-path schemes may overlap to some extents, due to the rich semantics of HIN. For example, in the second and the third clusterings, the persons in the same locations are more likely to belong to the same organizations. To sum up, NetMCs can generate meaningful alternative clusterings of HIN from different perspectives.

Fig. 3
figure 3

Parameter and convergence analysis of NetMCs

4.4 Parameter, convergence, and complexity analysis

Parameter analysis To study the impact of redundancy term, we set \(M=2\) and vary \(\lambda\) from \(10^{-3}\) to \(10^{3}\) with an exceptional condition \(\lambda =0\), and then plot the variation of quality (SC, the larger the better) and diversity (NMI, the smaller the better) of NetMCs on IMDb dataset in Fig. 3a. We see that: (i) the quality (SC) fluctuates within a certain range at first and then decreases rapidly as \(\lambda\) further increases; (ii) diversity (1-NMI) gradually increases and then keeps relatively stable. Overall, SC and NMI tend to decline as \(\lambda\) increases, and they are always below the starting point (\(\lambda =0\), no diversity control). This pattern is explainable, the larger \(\lambda\) is, the less similarity of probability distributions of the same node in different embedding spaces is. In addition, the enhancement of diversity between multiple clusterings are often accompanied by a decrease in quality. In summary, \(\lambda\) indeed helps to boost the diversity between clusterings. The best \(\lambda\) should give the highest value both in quality and diversity. Unfortunately, the best quality and diversity often can not be attained at the same time. As a result, a best \(\lambda\) is hard to choose. Users can adjust \(\lambda\) according to their prefer on diversity (large \(\lambda\)) or quality (small \(\lambda\)).

We study the impact of embedding dimension d of NetMCs. From Fig. 3b, we can observe that the quality (SC) fluctuates slightly, while the diversity (1-NMI) gradually decreases as the dimension increase. Overall, the embedding dimensions d impacts the quality of multiple clusters to some extent. The best balance between quality and diversity can be made when \(d\approx 128\sim 256\).

We vary M (number of alternative clusterings) from 2 to 6 on IMDb dataset to explore the variation of average quality (SC) and diversity (NMI) of multiple clusterings generated by NetMCs. In Fig. 3c, with the increase of M, the average quality (SC) decreases slowly while the diversity (NMI) fluctuates within a small range. Overall, NetMCs can generate \(M\ge 2\) alternative clusterings of quality and diversity, it obtains a better performance than other compared methods across different input values of M in most cases. In fact, the number of alternative clusterings of NetMCs can be adjusted according to user preference or determined in advance by prior knowledge.

We also investigate the effect of walk length l on the performance of NetMCs. In Fig. 3d, we vary walk length l from 10 to 60 and plot the corresponding SC and NMI under each fixed l. We find that either a too short or too long walk length has a negative impact on quality and diversity of multiple clusterings. That is because a too short walk length will affect the capture of semantic information, while a too long walk length will introduce noisy information.

Convergence analysis Figure 3e shows the loss value against the number of epochs for NetMCs and its variants. We see that at the beginning the loss value drops rapidly, and typically converges in around 5 epochs. This not only proves the efficiency of NetMCs, but also shows that the redundancy control term and cross entropy term do not significantly increase the complexity of the optimization procedure.

Complexity analysis The time complexity of NetMCs includes two parts. NetMCs takes \({\mathcal{O}}(Mlw|{\mathcal{V}}|)\) steps to obtain M groups of meta-path random walk sequences, and \({\mathcal{O}}(eMlw(c+b)d|{\mathcal{V}}|)\) steps to update the nodes’ embeddings and clustering assignment vectors, where e, l, w, c, b, and d are the number of epoches for optimization, the walk length, the number of walks per node, the neighborhood size, the number of negative samples, and the embedding dimensions, respectively. Note that \(eMlwcd\ll |{\mathcal{V}}|\), and the complexity of NetMCs is linear with respect to \(|{\mathcal{V}}|\), while existing multiple clustering methods (Yang and Zhang 2017; Mautz et al. 2018; Yao et al. 2019a; Wei et al. 2020b) typically have a quadratic or cubic complexity in \(|{\mathcal{V}}|\).

Table 4 Runtimes of compared methods (in min) on three network datasets

Table 4 gives the runtimes of the compared methods and of NetMCs. The experiments are conducted on a server.Footnote 2 All methods are implemented in Python supported by PyTorch machine learning framework, except for MNMF and MVMC that run on Matlab2014a. We observe that the three fastest methods are DMClusts, ASPEM and HEER. DMClusts has a linear time complexity. ASPEM and HEER both build on LINE (Tang et al. 2015), which considers the second order approximation of networks and ignores the semantic information embodied by meta-path; as such, they run faster than the other approaches. MNMF is also relatively fast, due to the decomposition of the sparse adjacency matrix into low dimensional ones. On the contrary, Nr-kmeans bears the curse of dimensionality and it cannot run on the large YAGO dataset. MVMC runs slowly due to the high time complexity of self-representation learning. NetMCs builds on metapath2vec, and the introduced terms (redundancy control and clustering assignment) of NetMCs do not increase the order of computational complexity, so NetMCs and metapath2vec have similar runtimes. Splitter is also a random walk based solution, but it needs to cluster each node based on its context to build the ego networks. Therefore, it has the highest runtime. Compared with other linear methods, NetMCs bears larger coefficient over \(|{\mathcal{V}}|\), the walk length l and the number of walks per node w. In addition, NetMCs has to generate both embeddings and clustering results simultaneously, while other compared methods do not. Thus, its running times are high with respect to most of the competitors. In conclusion, the runtime of NetMCs is in the medium range, and it runs slightly slower than metapath2vec, since it has to control the redundancy and execute clustering assignments.

5 Conclusion and future work

In this paper, we introduced the NetMCs model to explore alternative clusterings from the ubiquitous heterogeneous information networks, which is an interesting, practical but overlooked clustering topic that conjoins multiple embeddings and multiple clusterings of the same network. NetMCs can seek multiple embeddings and multiple clustering results simultaneously by a variation of the skip-gram model under different semantic meta-paths. It further introduces a redundancy term to improve the diversity between alternative clusterings. Experimental results confirm the advantage of NetMCs to state-of-the-art competitive multiple embeddings/clusterings solutions. We will investigate a principle to automatically choose the meta-paths (Sun et al. 2013; Zhou et al. 2019; Meng et al. 2015) and number of alternative clusterings, and multiple clusterings on dynamic network (Loglisci et al. 2012).