1 Introduction

Real-world networks are not random networks, and they usually exhibit inhomogeneity and reveal a high level of order and organisation [1]. An interesting feature that real-world networks usually present is the community structure property, under which the topology of network is organised into modules commonly called communities or clusters [2].

In many real-world network structures such as social networks and the World Wide Web, in addition to the link information, nodes also have their attribute values referred to as attribute/content information. For example, in a social network, the nodes’ properties could describe the roles of a person while the topological structure represents relationships among a group of people.

Most of the existing approaches found in the literature make use of either link information or attribute information analysis alone for community detection. However, in real-world networks, neither piece of information on its own is sufficient in determining good clusters of the network. The link information is usually sparse and noisy. On the other hand, relying on the attribute information alone could mislead the process of community detection. For example, the process may not identify the strength of a node’s relationship with its neighbours correctly. Consequently, by taking into account only one source of information, the algorithm may fail to detect accurately the entire community memberships. Considering more than one source of information for community detection could produce meaningful clusters and improve the robustness of the network. For instance, when considering both the attribute information and connectivity information, if either one source of information is noisy or missing, the other could make up for it. Therefore, the proposed approach will consider attribute information and structure information. The structure information consists of shared neighbours information and connectivity information aspects of the network [3].

1.1 Research benefit and its impact

Community structure is a common and important topological characteristic of many real-world complex networks. Nodes belonging to a tight-knit community are more than likely to have other properties in common [4]. The determination of communities in the networks can provide powerful insights into the structure of networks, help to better understand the structural make-up of the networks and analyse complex phenomena at different scales [5, 6]. Thus, the outcome of this research work has valuable applications in several fields such as biology, social science, physics, computer science and business science [5, 7].

In social networks, for example, analysis of community detection is extremely useful in the context of many applications, including customer segmentation, vertex labelling, recommendations and link inference [8]. Community structure is important not only in social networks, but also in various other networks. For example, determination of community structure in the Internet can address questions such as how to route data as packets in an efficient way, how to reduce the time consumption for such traffic and what is the fast and safe path to consider to reach the destination. It can go further in depth, by elucidating questions like how computer viruses are spreading through the Internet and what mechanisms they follow to hit organisations. Also in dark networks, community structure can reveal the hidden relationships between individual terrorists [9]. Similarly, in the case of the World Wide Web (WWW) pages related to the same subject are typically organised into communities, so that the identification of these communities can help the task of seeking for identifying the category of the network as well as understanding its dynamic evolution and organisation [10].

Clustering is an important technique in mobile ad hoc and sensor networks [11] for the improvement of certain management, e.g. energy consumption and communication tasks. Yu and Chong [12] reported that the cluster structure is an effective topology that could provide many benefits in the context of wireless sensor networks (WSNs). It could be used to increase the system capacity by spatial reuse of resources. Furthermore, it improves routing performance, because of the fact that the set of cluster heads and cluster gateways can normally form a virtual backbone for inter-cluster routing, and thus, the generation and spreading of routing information can be restricted to this set of nodes. Additionally, they stated that the cluster structure makes an ad hoc network appear smaller and more stable in the view of each mobile terminal.

1.2 Related work and scope of study

1.2.1 Related work

Community detection is an active area of network science research and over the years, a wide variety of community detection algorithms have been proposed to find the communities in the network. Community detection is also named as graph partitioning, in much of the literature [13, 14]. It is tempting to suggest that community detection and graph partitioning are really addressing the same question, since both their aim is to identify groups of nodes on a network that are better connected to each other than to the rest of the network. However, it is very important to stress that the task of graph partitioning and community detection can be distinguished from one another based on whether the experimenter fixes the number and size of the groups or it is unspecified [15]. Graph partitioning is the problem of partitioning a graph into a predefined number and size of clusters. It has been pursued particularly in computer science and related fields with applications in parallel computing and very-large-scale integration (VLSI) design, whereas, in the community detection, which has been pursued by sociologists and more recently by physicists and applied mathematicians, with applications especially to social and biological networks, the number and size of clusters are unspecified. Furthermore, the goal in the former is usually to identify the best division of a network regardless of whether or not a good division existed. In case there are no good divisions existing, the least bad one will be identified as the solution. On the other hand, in the latter, the algorithm only divides the network when good divisions exist and leave the network undivided in case there are no good divisions existing [3, 15].

The community detection algorithms can be classified in different ways, and depending on the selected criteria, one algorithm can belong to more than one category. Among them, those based on modularity maximisation form the most prominent family of community detection algorithms such as fastgreedy algorithm [16] and Louvain algorithm [17].

Fastgreedy algorithm is an agglomerative hierarchical clustering method proposed by Newman [16]. The algorithm greedily maximises the modularity function Q and starts the process by assigning a different community to each node in the network. Then, at each stage in the process, the pair of clusters that yields greatest increase of modularity or smallest decrease is merged until only one cluster remains containing all nodes in the network. The whole procedure can be represented by a dendrogram (hierarchical tree) that illustrates the order of the mergers. Cuts through the dendrogram at different levels give different partitions into communities. The optimal community cluster can be found by cutting the dendrogram at the level of maximum Q.

Louvain algorithm is a hierarchical agglomerative optimisation method proposed by Blondel et al. [17] and attempts to optimise the modularity of a partition of the network. The optimisation is performed in two steps that are repeated iteratively. This algorithm starts with each node in the network belonging to its own community. Then, in the first step and for each node in the network, the algorithm uses the local moving heuristic to obtain an improved community structure by moving each node from its own community to its neighbours’ community and evaluating the gain of modularity associated with the moving of the node. The node is then placed in the community for which the modularity change is the most positive. If none of these modularity changes is positive, the node stays in its original community. This process is applied repeatedly and sequentially for each node until all the nodes in the network are considered, and no further improvement can be achieved. This concludes the first step. The second step of the algorithm consists of building a new network from the communities discovered in the first step. Therefore, the individual nodes in the new network are the individual communities from the first step. In this new network, there will be an edge between two nodes if there were edges between the corresponding two communities in the previous step. The weights of those new edges are the sum of the weights of the edges between nodes in the corresponding two communities. The edges between nodes of the same community in the first step will lead to self-loops for this community node in the new network. Once the second step is completed, it is possible to replay the first step and iterate again if necessary. The two steps repeat iteratively and stop when there is no more change in the modularity gain, and consequently, a maximum modularity is obtained.

Another popular method widely used to find communities in the network is based on the random walk. An example includes Walktrap (WT) algorithm which is proposed by Pons and Latapy [18]. Walktrap algorithm is based on the principle that random walks on a network tend to get ‘trapped’ into densely connected parts defining the communities. In this method, the authors propose using a node similarity measure based on short walks to capture structural similarities between nodes instead of modularity to identify community via hierarchical agglomeration. The algorithm starts by assigning each node to its own community, and the distance for every pair of communities is computed. Communities are merged according to the minimum of their distances and the process iterated. After n − 1 steps, the algorithm finishes and gives a hierarchical structure of communities called a dendrogram. The best partition is then considered to be the one that maximises modularity.

Information theoretic algorithms are another major type of community detection clustering algorithms that use the concept of information theory to find community clusters in the network. Infomap algorithm is an example of information theoretic algorithms proposed by Rosvall and Bergstrom in [19].

Infomap algorithm characterises the problem of finding the optimal community clustering in the network as the problem of finding the most compressed (shortest) description length of the random walks on the network. It uses a random walk as a proxy for information flow in a network and minimises a map equation, which measures the description length of a random walker, over all the network clusters to reveal its community structure. To represent the community structure, the algorithm uses a two-level nomenclature based on Huffman coding: a level to distinguish communities in the network and the other to distinguish nodes in the community. In practice, the random walker is likely to stay longer inside communities, and therefore, in the process of finding a community containing few inter-community links, only the second level is needed to describe its path, leading to a compact representation.

However, most of these algorithms are classified as global algorithms, which require access to the information of the entire network and make use topology information and largely ignore the attribute information [2].

1.2.2 Background and scope of study

Another property of similar interest is transitivity or global coefficient clustering, which is defined as the tendency between two nodes to be connected if they share a mutual neighbour [20]. In terms of network topology, transitivity is defined as the presence of a heightened number of sets of three vertices with edges between each pair of nodes (triangles) in the network.

Empirical studies have found that the concept of transitivity applies in about 70–80% of all cases across a variety of small group situations [21, 22]. Huijuan and Shixuan [23] proposed a graph clustering algorithm called SNGC that considers both connectivity between nodes and shared neighbours. Their experimental results show that the proposed algorithm provides promising results and could be applied to the analysis of social networks, computer networks, bioinformatics, etc.

Another common occurrence in networks is that similar nodes associate with each other more often than others (e.g. in social networks, people choose to be friends with people who share their beliefs). This property is known as homophily [24]. Traud and Kelsic [25] show that a set of nodes’ attributes can act as the primary organising principle of the communities. Several studies have been performed to investigate this phenomenon of homophily, which is summarised in McPherson et al. [24].

There have been modifications and revisions to many methods and algorithms already proposed. A comprehensive survey of community detection in graphs has been done by Fortunato in [2]. Other reviews available in the literature are by Bedi and Sharma in [26] and Plantié and Crampes in [27].

Recently, there have been several studies [28,29,30,31,32,33], [34] showing that the combination of attribute and link information to detect communities in a network can improve the clustering quality. Most of these studies propose new algorithms that aim to use both sources of information; however, most methods use all attributes the same way without considering which ones may influence the community structure more, and lack the flexibility of balancing the information coming from network adjacency matrix (link information) and its node attributes.

Considering more than one source of information for community detection could produce meaningful clusters and improve the robustness of the network. Therefore, a pre-processing approach that considers both the attribute information and connectivity information aspects of the network for community detection is presented in this work. It should be noted that this work does not attempt to introduce a new community detection algorithm and rather proposes a pre-processing step to improve the performance of the existing community detection algorithms and enable them to execute in unreliable data network environments with better results.

In this paper, a network is represented as an undirected network G = (V, E, A), where V is the set of nodes and E is set of edges between nodes. Each node Vi ∈ V is associated with an attribute vector (\({\text{Att}}_{i}^{1} , \ldots {\text{Att}}_{i}^{d}\)), where d is the attribute dimension and i represents the node ID.

The main goal of this work is to find K non-overlapping communities in the network where the community (C) is defined as a list of non-empty node subsets: C = {\(C_{1}\), \(C_{2} , \ldots ,C_{k}\)}, and \(V = \cup_{i = 1}^{k}\)\(C_{i}\) that satisfy \(C_{i} \cap C_{j} =\) ∅ for any i ≠ j.

1.3 Contributions arising from this work

During the past decade, the problem of community detection in networks has drawn a great deal of attention and several algorithms have been proposed. Recently, several studies have proposed methods that make use of both attribute and link information to detect communities in a network. However, as mentioned in the previous section, most of these studies propose new algorithms that aim to use both sources of information, use all attributes the same way without considering which ones may influence the community structure more, and lack the flexibility of balancing the information coming from network adjacency matrix and its node attributes. Additionally, none of the studies examines the quality and the number of community structures that could be identified in the network when some of the links are missing, i.e. noisy network environment.

The aim of this work is to design and implement a method that seeks to improve the performance of the existing community detection algorithms for incomplete networks. Hence, to the best of our knowledge, this is the first study on the community structure that seeks to:

  1. 1.

    Design and implement a unique pre-processing approach for the state-of-the-art community detection algorithms by tightly integrating the attribute information, shared neighbours and connectivity information aspects of the network to produce a new matrix.

  2. 2.

    Study the correlation between communities and attributes in the network and introduce weight detection attribute model to learn the degree of contributions of different attributes based on the impact of attribute on the community structure.

  3. 3.

    Evaluate the performance of pre-processing approach within incomplete, networks.

1.4 Structure of the paper

This paper is organised as follows: the experimental datasets along with the quality metrics for assessing the network clustering results are discussed in Sect. 2. Section 3 investigates the correlations between attributes and community structure of the network. Section 4 describes the novel proposed method along with a similarity matrix, used to weight the links between nodes in the network. Section 5 briefly presents the experimentations and evaluates the results of the proposed approach against the benchmark algorithms. The conclusion and future work are presented in Sect. 6.

2 Datasets and performance metric

2.1 Datasets

In order to investigate the correlations between attributes and community structure and to evaluate the proposed approach, anonymised Facebook datasets as introduced by Traud et al. [35] and [25] are used. The Facebook datasets are undirected and unweighted. The datasets were recorded on a particular day in September 2005 and contain Facebook networks from 100 different American university networks whose nodes represent users and whose links represent friendships between users. Attribute information about each user is also provided. Each user has seven node attributes: a student/faculty status flag, gender, major, second major/minor (if applicable), dormitory (house), year and high school. In this work, four networks from 100 Facebook datasets are used. In particular, the Caltech36, Reed98, Haverford76 and Vassar85 datasets, which contain 769, 962, 1446 and 3068 nodes and 16,656, 18,812, 59,589 and 119,161 edges, respectively, are used.

For more information about dataset, interested readers may refer to work by Traud et al. in [35] and [25]. However, the proposed approach in this work is not limited to the social networks but can be applied to many kinds of graph structures.

2.2 Performance metrics

To quantify the performance of the proposed approach, the quality of the obtained community structures is evaluated based on the modularity, number and size of detected communities.

Definition 1 modularity ( Q )

Modularity (Q) is a prominent measure for the quality of a community structure introduced by Newman and Girvan in [36], and it has become a widely accepted quality of measure for community detection. Modularity states that a good cluster should have a bigger than expected number of connections between the nodes within modules and a smaller than expected number of connections between nodes in different modules. The higher the value of modularity, the better its community strength.

Formally, modularity can be defined as [2]:

$$Q = \frac{1}{2\left| m \right|}\mathop \sum \limits_{ij} \left[ {A_{ij} - \frac{{K_{i} K_{j} }}{2\left| m \right|}} \right]\delta_{{c_{i} c_{j} }}$$

where Aij is an element of the adjacency matrix, \(K_{i}\) is the degree of node i. \(\delta_{{c_{i} c_{j} }}\) is the Kronecker delta symbol, which is equal to 1 if ci = cj and 0 otherwise, and ci is the label of the community to which node i is assigned.

3 Correlation analysis

3.1 Shared neighbours

In order to measure how likely any two nodes with a common neighbour are themselves connected, the clustering coefficient of each node in the network is calculated.

Definition 2 clustering coefficient CCO

The node clustering coefficient \(C_{i}\) of a node i is defined as the ratio of the number of edges connecting the neighbours of i to the total possible number of such edges of i, and \(K_{i}\) is the degree of node i [10]:

$${\text{CCO}}_{i} = \frac{{2L_{i} }}{{K_{i} \left[ {K_{i} - 1} \right]}}$$

where \(L_{i}\) is the number of edges between neighbours of node i.

The clustering coefficient for the whole network is the average of the local values \(C_{i}\):

$${\text{CCO}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} {\text{CCO}}_{i}$$

where n is the number of nodes in the network [10].

Figure 1 shows the visualisation results of the cluster coefficient for each node in the four datasets. In this figure, colours of nodes correspond to values of their corresponding clustering coefficients. As can be seen, there are some nodes that have high clustering coefficients, which indicates strong connectivity between each other. In other words, they are more prone to be in the same cluster. Furthermore, the clustering coefficient for the considered networks is 0.4288, 0.3304, 0.3268 and 0.2487 for Caltech36, Reed98, Haverford76 and Vassar85 datasets, respectively.

Fig. 1
figure 1

Visualisation results of node clustering coefficient for subset of four datasets (colour figure online)

It is clear from the above discussion that the shared neighbours’ information can be used to describe the nature of connections between nodes in the network. This should motivate the use of shared neighbours’ information in detecting community clusters in the network.

3.2 Correlation of communities and attributes

For the sake of computing the correlation between connectivity of nodes and their attributes, the nodes are clustered based on their attributes in which the nodes whose attributes are similar are grouped together to form a cluster. Also, four different community clustering algorithms, which are Fast Modularity [37], Louvain [17], leading eigenvector algorithm [38], and Walktrap [39] are applied on the datasets to find the communities. Then, the correlations between the resulting communities from these algorithms and the attributes are measured using Jaccard similarity index, which was introduced by the Paul Jaccard in [40].

Figure 2 presents the Jaccard similarity index for four different community detection algorithms with each attribute over the four networks in the Facebook dataset. It is interesting to notice that for the same dataset, the order of the correlation strength across different attributes is not same and varies from one community clustering algorithm to another. For example, in Reed98 dataset, if the agreement with the Fast Modularity algorithm is considered, the most agreement is observed with the attribute ‘student faculty’. On the other hand, Louvain algorithm performs the best if the agreement with the ‘year’ is considered. This is due to the fact that each algorithm differs on how they treat the nodes and assign them to different communities with different size and number of communities.

Fig. 2
figure 2

Agreement of different community detection algorithms with each attribute, for a subset of four datasets

Even though there exists a difference in attribute ranking across different algorithms and datasets, as an overview, the most agreements are observed with student faculty, gender, year and dormitory attributes. However, in computing the correlation between attributes and community structure, Traud and Kelsic [25] reported that the order of correlation strength is significantly dependent on the agreement index used and not consistent across different indices.

Observing the correlation between the attributes and the communities in the network indicates that the attribute information is a source of data that can be used to perform the community clustering task. Furthermore, based on the homophily property of a network as shown above it is clear that the linked nodes are more likely to share similar attributes. However, the attributes do not have the same influence as the community structure and some attributes weigh more than others in their influence. Thus, the impact of different attributes on communities needs to be known and properly weighted according to their influence on the community structure. This will balance the role of network information and node attributes.

4 The proposed optimisation approach

The proposed approach could be defined as a pre-processing phase for conventional community clustering algorithms, which takes a graph G = (V, E, A), the weight of attributes (W) and two more weighting factors (α and β) as inputs. α is used to weight the contribution between connectivity information and both attribute and shared neighbours’ information. β is used to weight attribute information to the number of common neighbours. However, these weighting factors (W, α, β) can be either provided as part of the input if they are known a priori or calculated from the dataset.

The proposed approach returns a hybrid similarity matrix. The hybrid similarity matrix is a weighted combination of attribute information, shared neighbours’ information and connectivity information between the nodes. Once the proposed approach constructs the hybrid similarity matrix, it can be integrated with any of the state-of-the-art clustering algorithms proposed for weighted graph (e.g. Newman fast greedy algorithm, Louvain algorithm, Newman algorithm based on leading eigenvector of a modularity matrix or Walktrap algorithm) to extract optimum community clusters.

4.1 General architecture

The general architecture of the proposed approach is shown in Fig. 3. As can be seen in the figure, the approach has two phases, namely the parameter learning phase and information aggregation phase. The aim of the first phase is to extract optimal parameters, whereas the second one is used to build a hybrid similarity matrix.

Fig. 3
figure 3

System architecture for the proposed approach

We formally describe the generative process of hybrid similarity matrix as the following:

$$H_{\text{sim}} \left( {i,j} \right) = \,\propto A\left( {i,j} \right) + \left( {1 - \propto } \right)\left[ {\beta Wa_{\text{sim}} \left( {i,j} \right) + \left( {1 - \beta } \right) SN_{\text{sim}} \left( {i,j} \right)} \right]$$
$$Wa_{\text{sim}} \left( {i,j} \right) = W A_{\text{sim}} \left( {i,j} \right)$$

where \(H_{\text{sim }} \left( {i,j} \right)\): hybrid similarity matrix, A: adjacency matrix (matrix representation of exactly which nodes in the network contain edges between them), \(Wa_{\text{sim}} \left( {i,j} \right):\) the weighted attribute similarity between a pair of nodes (i, j), α: the weighting factor used for the contribution of connectivity information to the attribute information and shared neighbours information, β: the weighting factor used for the contribution of attribute information to the number of common neighbours information, \(SN_{\text{sim}} \left( {i,j} \right)\): shared neighbours similarity between nodes i and j, \(A_{\text{sim}} \left( {i,j} \right)\): the attribute similarity between a pair of nodes (i, j) in network G = (V, E, A), and W: a matrix containing the weights of each attribute of the node in the network.

Definition 3 shared neighbours

Given a graph G = (V, E), for a node i ∈ V, the neighbours of node i are nodes that directly connect to node i and is denoted by \(\Gamma\)(i).

The shared neighbours of node i and j are the nodes that both directly connect to nodes i and j. It is defined as:

$$SN\left( {i,j} \right) = \left\{ {\Gamma \left( {\text{i}} \right) \cap\Gamma \left( {\text{j}} \right)} \right\} .$$

The shared neighbours similarity between nodes i and j is calculated by dividing the number of shared neighbours between them by the maximum degree of i and j nodes. It is defined as:

$$SN_{\text{sim}} \left( {i,j} \right) = \frac{{SN\left( {i,j} \right) }}{{\hbox{max} \left[ {K_{i} ,K_{j} } \right]}}$$

where \(SN\left( {i,j} \right)\): shared neighbours between nodes i and j and \(K_{i}\): degree of node i

In the hybrid similarity matrix, as is defined in Eq. 4, the strength of relationship between nodes is determined by attribute information, connectivity information and shared neighbours and controlled by two weighting parameters (α and β). The α and β weighting parameters can be given as part of the input values by the human agent based on their knowledge of the data structure and their perception of the importance of each attribute. However, choosing the right weighting values of attributes without a priori knowledge of the network is a challenging task. Hence, the values of the attribute weighting factors (W) in the proposed approach need to be set carefully. In the following sections, the two phases of the proposed approach (the parameter learning phase and information aggregation phase) will be discussed in detail to provide guidelines on how to set these parameters.

4.2 The parameter learning phase

Since the goal of utilising details on attribute information, shared neighbours and connectivity information in this work is to get the best community clusters for the network, the attributes of the nodes should be weighted in such a way that greater weight is given to the more influential attributes and smaller weights for the less influential. Determining the influence and thus the weights of the attributes correctly will enhance the community structure algorithm and improve the detection of communities in the networks. The main purpose of the proposed attribute weighting technique is to search for small groups of nodes (initial clusters) that contain more internal connections (links between nodes in the group) than external connections (between nodes of the group and nodes in other groups) and then find the attribute similarity between nodes in the same groups to get the influence factor for each attribute.

To accomplish this, the parameter learning phase, as shown in Fig. 3, is subdivided into two stages: local clustering stage and attribute weighting stage. Local clustering phase is to extract dense nodes from the network to form the initial clusters. These initial clusters are local small ones, far from being the optimal result, and are only used in the second stage to weight the attributes of each node in the network as well as estimate the α and β parameter values.

In the local clustering phase, the initial clusters are obtained by applying the first phase of the DICCA algorithm proposed by the authors in [41], named local clustering phase. The basic idea of the local clustering phase in DICCA consists of picking up m nodes to be originators in which the m nodes are spread out across the entire region of the network and assigning each node to the closest originator to form a cluster.

The attribute weighting stage is then applied to find the strength of the weighting for each attribute based on the structures of current clustering results. During the attribute weighting stage, the set of attributes for each node are weighted according to its influence in the community in which the highly influential attributes are assigned with high strength weights; meanwhile, the less influential attributes are assigned with low strength weights.

In order to find the attribute weighting, it is necessary to measure the proximity between pairs of nodes in the initial clusters based on their attributes. To do so, the attribute similarity metric needs to be defined first.

4.2.1 Attribute similarity metric

The attribute similarity between nodes Vi and Vj within the same cluster is determined by examining each of d set of attributes on the two nodes and reflect on the strength of the relationship between them in terms of their attribute values.

It must be emphasised that irrespective of the similarity metric considered to find the weight of attributes, first, the similarity between the attribute values of each pair of nodes belonging to the same local cluster needs to be calculated. The procedure is as follows:

Let \(X_{N.d}^{i}\) be the similarity matrix for cluster i with N nodes each with d attributes, the local attribute weight for cluster i is obtained by adding the appropriate dimension attribute of each node in the cluster to form a vector of 1 × d size and expressed as:

$${\text{LW}}_{d}^{i} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{d} \left( {X_{N.d}^{i} } \right).$$

The weighting for the entire network is then calculated by adding the corresponding attribute of each local attribute weight (sum of the vectors) to form another vector in 1 × d size. It is formally defined as:

$$W = \frac{1}{m}\left( {\mathop \sum \limits_{i = 1}^{m} {\text{LW}}_{d}^{i} } \right)$$

where \({\text{LW}}_{d}^{i}\): the local attribute weight for cluster i and W: attribute weights of the node in the network.

It is worth mentioning that the weights assigned to the attributes in the parameter learning phase \({\text{LW}} = \{ Lw_{1} ,Lw_{2} \ldots Lw_{m} \}\) range between 0 and 1.

Whether or not a certain subset is optimal depends on the similarity metric employed. The question about what are the best similarity measures between nodes to choose, for different types of attribute data, is beyond the scope of this work. In this work, a Jaccard similarity coefficient is used to define the attribute similarity between nodes in the same cluster and to find the weight of attributes (W) during the parameter learning phase. For an overview of the research work on determining the most meaningful similarity measures in various fields and for different types of data, see [42, 43].

Definition 4 Jaccard similarity

Given a network G = (V, E, A), for any pair of nodes Vi, Vj ∈ V, the Jaccard similarity between nodes Vi and Vj with respect to attribute is indicated as J(Ai, Aj) and is defined as the size of the intersection divided by the size union of the data sets, as given below [44]:

$$J\left( {A_{i} , A_{j} } \right) = \frac{{\left| {A_{i} \cap A_{j} } \right|}}{{\left| {A_{i} \cup A_{j} } \right|}}$$

where \(J(A_{i} ,A_{j} )\) returns a value between 0 and 1, with 0 denoting no similarity and 1 denoting identical sets.

Furthermore, since in this work Jaccard similarity is used to measure attribute similarity between nodes, the \(X_{N.d}^{i}\) could be defined as the Jaccard similarity matrix for cluster i. The weighted attribute similarity \(Wa_{\text{sim}} \left( {i,j} \right),\) between any two nodes i and j is defined as follows:

$$Wa_{\text{sim}} \left( {i,j} \right) = \frac{{\mathop \sum \nolimits_{L = 1}^{d} (W_{L} *\left[ {{\text{Att}}\_i_{L} \cap {\text{Att}}\_j_{L} } \right])}}{{ \mathop \sum \nolimits_{L = 1}^{d} (W_{L} *\left[ { {\text{Att}}\_i_{L} \cup {\text{Att}}\_j_{L} } \right] )}}$$

where each node has d attributes and \({\text{Att}}\_i\) is the attribute vector of node i.

The pseudo-code outlining the entire procedure with Jaccard similarity is listed in Algorithm 1. Effect of α and β on the quality of community structure

When considering to select the values for the two weighting factors (α and β), the type of emphasis on one of the network parameters needs to be considered. For example, emphasis on the connectivity information source means that the parameter α should be greater than 0.5. On the other hand, emphasis on attribute and shared neighbours information means that α should be less than 0.5. The same argument holds good for the parameter β, i.e. β greater than 0.5 indicates that attribute node information source has more contribution than the information related to the number of common neighbours. In the networks, the weighted combination of attribute information, shared neighbours and connectivity information is not same and the values of α and β need to be selected carefully. However, in practice without any prior domain knowledge, it is quite difficult to scale the contribution of each source of information.

In order to determine the effects of varying α and β parameters on the quality of community clustering and thereby to determine the parameters’ selection range, four different datasets are used to track how the community clustering changes when the values of α and β are varied from 0.1 to 1 with a step size of 0.1. Also, modularity index is used to evaluate the quality of community detection.

Figure 4 shows how the two parameters influence the community clustering quality. The X-axis and Y-axis in the figures represent the values of α and β, respectively, while the Z-axis represents the modularity score. As can be clearly seen from Fig. 5a–d, the modularity is remarkably robust to the choice of parameter values. When α = β = 0, the modularity of community detection is ≥ 0.25 for most of the algorithms for all the datasets. However, it is worth mentioning that α = β = 0 indicates that the information used to find the community clustering is just based on the number of common neighbours \(H_{sim} \left( {i,j} \right) = SN_{sim} (i,j\)).

Fig. 4
figure 4

ad Modularity value achieved by four community clustering algorithm dataset using different value of α and β on: a Caltech36, b Reed98, c Haverford76, d Vassar85 datasets

Fig. 5
figure 5

Attribute weights for four datasets

As an overview, with an increasing value of β, the quality of community clustering decreases for a constant value of α. On the contrary, with an increasing value of α, the quality of community clustering increases slightly for a constant β value. It is also noticed that for values of α < 0.6 the modularity is dramatically affected by varying the value of β. The modularity fluctuates between 0.01 and 0.4, and it becomes relatively stable when α value ranges between 0.6 and 0.7. However, the modularity becomes almost stable for the vast majority of β values when α\(>\) 0.7.

figure a

Experimental results also demonstrate that the connectivity information is more useful than the shared neighbours’ information and attribute information. Therefore, the value selected for α should be greater than or equal to 0.5. For the datasets considered in this work, high modularity values are obtained when α\(>\) 0.7.

With regard to these two parameters α and β, there is no straightforward way to fit them to datasets and different datasets may require different parameter values. However, based on the above argument, in order to better exploit the sources of information and obtain optimum robustness in the detection of community clusters in the presence of noise, the value of α is set based on the weights of attributes (w) as follows:

$$\alpha = {\text{avg}}\left( w \right).$$

In this work, to avoid a cumbersome decision process, equal importance is given to shared neighbours and attribute information in which β = 0.5 is set in all the following performed experimentations.

4.3 Information aggregation phase

The information aggregation phase aims to build a weighted matrix, named hybrid matrix, based on the knowledge learned from the parameter learning phase. These weighted attributes w, α and β values are used to build a hybrid similarity matrix as defined in Eq. 4. In the hybrid matrix, the edges that link nodes do not have similar attributes or do not have shared neighbours will be punished and assigned with low strength weights, while the edges connecting similar nodes or having shared neighbours will be assigned with high strength weights. Also, there are some edges which will be added between the nodes to represent the attribute and shared neighbour similarity.

5 Experimentation and results

5.1 Experimental setup

In order to assess the effectiveness of the proposed approach to detect communities under an unreliable network structure, an experimentation has been conducted using four different Facebook dataset networks when some edges are missing, while the node attributes are fully available. Furthermore, for the sake of evaluation, edges are removed from the network at random and the number of removed links is increased from zero to half the number of edges in the network in steps of 5% of network edges.

In each experiment, the performance is computed using the results obtained by applying each of the four algorithms with and without applying the proposed approach as a pre-processing step. Each algorithm has been applied more than once on the data, and the experimental results presented are the average of ten simulation runs.

To quantify the performance of the proposed approach, the quality of the obtained community structures is evaluated based on the modularity, number and size of detected communities.

The performance of the proposed approach is evaluated in terms of repeatability and reproducibility when noise is introduced in the environment. This is measured by its ability to find the same ground truth communities detected under normal conditions even when noise is introduced. The outcome of the community clustering solution obtained by each algorithm with the original dataset (complete dataset) is used as a ground truth and compared against the outcome of the clustering solutions when a number of edges are progressively removed from zero to half the number of total edges in the network.

Moreover, for simplification, in the following sections when the proposed approach is combined with Fast Modularity algorithm (FA) it is referred to as Hybrid-FA; when combined with Louvain algorithm (LA) as Hybrid-LA; when combined with leading eigenvector (LE) as Hybrid-LE; and Hybrid-WA when combined with Walktrap algorithm (WA). Additionally, to facilitate comparison of results in line charts, the results obtained using the proposed approach are denoted by dashed line style with ‘x’ marker points.

It is worth mentioning that we have attempted to define and evaluate the computational complexity of this algorithm in [45]. While the exact mathematical model for the computational complexity of the pre-processing algorithm is harder to formalise, it could be represented using the computational model as [ log \(\left( {\text{nm}} \right)^{2}\)], in which n is the total number of nodes in the network and m the number of edges.

5.2 Experimental results and discussion

In this subsection, the effectiveness and efficiency of the algorithm are assessed from two aspects. One is to evaluate the attribute weighted method proposed in this work along with the methodology used to set the parameter value. The other aspect is to integrate the proposed approach with well-known community clustering algorithms and make a comparison of the results achieved without the integration to show how the proposed approach can be used to improve the robustness and quality of those well-known community clustering algorithms.

5.2.1 Evaluation of attribute weighting method

As highlighted in Sect. 3, different attributes have different significance for assessing the similarity between the nodes in the same community clusters; therefore, the attribute weighting method is proposed. In this section, the performance of the proposed attribute weighting method is experimentally evaluated.

The evaluation is done by checking how well the weight of the attributes (W) obtained by the weighting method match with the actual important attributes and is presented in Fig. 2.

Figure 5 shows the attribute weights obtained by the weighting method for the four datasets under consideration. It is obvious that the attributes have different weight strengths and order of importance for different datasets. However, looking at the attribute weights of the four data sets, it is clear that four specific attributes (student, gender, dormitory and year attribute) have the highest weighting values across all four data sets. Anyway, the remaining attributes (high school and major/minor attribute) do not have strong influence on the community structure, hence weighted with a very small value in the attribute weighting stage.

Moreover, the comparison between Figs. 2 and 5 shows that the parameter learning phase achieves almost the same results in most cases, whereas the attribute importance order is either same or only slightly different due to small differences in the attribute correlation. For example in Caltech36 dataset, the order of importance attributes are student, gender, year and house with attribute weight values 0.4695, 0.3102, 0.2195 and 0.2193, respectively. In comparison with Fig. 2 and for the case of the Fast Modularity algorithm as an example, the order is changed to student, gender, house and year attribute, achieving Jaccard index values of 0.2772, 0.2412, 0.1746 and 0.1239, respectively.

Furthermore, to evaluate the performance of the proposed weighting method in handling noisy data, Fig. 6 shows the values of attribute weight for the four largest weighted attributes obtained by the weighting method when the percentage of removed edges varied from 0 to 50%. From the figure, it is worth noting that the ordering of weights is remarkably stable and the attribute weighting method shows an effective performance by getting rid of the noisy datasets and correctly weights attributes according to their importance.

Fig. 6
figure 6

Robustness of weighting method to the edge removal

To further assess the parameters analysis phase, the number of initial clusters identified at local clustering stage along with the value of α against the per cent of removed edges, for the four datasets, is reported in Table 1.

Table 1 Results for four datasets

The results in Table 1 indicate that the noise has no significant influence on the value of α. In other words, the method used to define α value (see Eq. 12) is somewhat stable. In addition, it is clear that local crusting tends to partition data to a larger number of initial clusters. Considering Reed98 dataset, for example, when the missing edges varied from 0 to 50%, the values of α and the number of obtained initial clusters were {0.808, 382} and {0.823, 446}, respectively.

It is also worth noting from Table 1 that the value of α is not related to the number of initial clusters found by the local clustering stage. In some cases, higher value of α is obtained when more initial clusters are found. For others, however, the value of α increases when fewer initial clusters are found. Considering Reed98 dataset, for instance, when the missing edges increased from 15 to 20%, both α value and the number of initial clusters increased from {0.814, 399} to {0.816, 405}, respectively. On the other hand and for the same dataset, when the missing edges increased from 5 to 10%, the value of α increased from 0.812 to 0.813; meanwhile, the number of initial clusters decreased by 3. However, the value of α for the four considered datasets is always higher than 0.75. This value is in agreement with what was observed in Sect., where the connectivity information contains more useful information than the shared neighbours or attribute information (α ≥ 0.5) and to get high modularity the value of α should be higher than 0.7.

Overall, the results clearly demonstrate that the parameter learning method has the ability to extract essential and informative attributes and to weight them to reflect the relative importance of attribute in community clustering tasks.

5.2.2 Model performance

In this subsection, using the optimal parameters determined using the parameter learning phase (as discussed in Sect. 4.1), the performance of the pre-processing approach is evaluated. Number of community clusters

Since the number of communities in the networks is unspecified, the algorithms try to automatically detect the most appropriate number of communities by maximising the modularity. The variation in number of community clusters when different numbers of edges are removed is shown in Fig. 7. It is observed that the conventional algorithms are adversely affected by noise, so fail to account for appropriate community structures. Moreover, most cases result in an increasing number of communities with an increasing 5% of missing edges. The only exception is the LEA algorithm, which results in almost the same number of communities even without applying the pre-processing approach.

Fig. 7
figure 7

Number of community clusters for: a Caltech36 university dataset, b Reed98 university dataset, c Haverford76 university dataset, d Vassar85 dataset

Considering Caltech36 dataset, for example, increasing proportions of edges are randomly removed from the network (from 0 to 50%), the number of communities detected by all conventional algorithms is changed from {10,10,12,72} to {39,39,10,104} for {FA, LA, LEA, WA} algorithms, respectively. Such behaviour can be explained by the fact that the conventional algorithms consider only topology information. On the other hand, the proposed approach considers attribute, shared neighbours and connectivity information. Since the nodes in the same community usually are not just highly connected but also have similar attributes and transitivity coefficient, the proposed approach uses attribute information to make up for the missing link information and to identify the community membership. Consequently, integrating the proposed approach with a conventional algorithm is more advantageous for discovering the most appropriate number of community structures than using the conventional algorithm on its own.

Walktrap algorithm when run on the dataset on its own failed to detect the appropriate number of communities, and compared to the other algorithms, the number of communities returned by Walktrap is extremely high for all considered datasets. However, applying the proposed approach as a pre-processing step to build the hybrid similarity matrix before applying the Walktrap community detection algorithm has significantly improved the performance to obtain just 8 clusters.

Furthermore, when the percentage of removed edges is increased from 0 to 50%, the number of clusters formed using the proposed approach is more similar to the original partition network when there is no noise applied. For example, in the case of Caltech36 dataset when 50% of edges are missing, the number of obtained communities is {4,8,8,4} for {Hybrid-FA, Hybrid-LA, Hybrid LEA, Hybrid-WA} algorithms, respectively. This demonstrates that the proposed approach has the capability to extract relevant information from highly noisy datasets and make these algorithms quite robust to edge removal. Size of community clusters

To take a closer look at the sensitivity of the obtained communities to the noise, the average size of the obtained communities, when percentage of removed edges is increased from 0 to 50%, is investigated and shown in Fig. 8.

Fig. 8
figure 8

Average community size for: a Caltech36 university dataset, b Reed98 university dataset, c Haverford76 university dataset, d Vassar85 dataset

Considering Vassar85 dataset, for example, increasing the proportion of edges that are randomly removed from network (from 0 to 50%), the average community size detected by all conventional algorithms dropped from {614, 511, 438, 51} to {94, 95, 583,28} for {FA, LA, LEA, WA} algorithms, respectively. In contrast, combining the proposed pre-processing approach with the community clustering algorithms considered in this work results in community clusters with almost constant average size. This effect comes from the fact that since the conventional community identification is based only on the adjacency matrix, the number of community clusters obtained is heavily dependent on the number of links in the network, so as the percentage of missing edges increases, the clustering algorithm becomes less stable and the clusters become smaller. In contrast, this is not the case for the hybrid similarity matrix, which is based on different considerations (attribute information, shared neighbours information and connectivity between nodes in the network). Modularity

Regarding the quality of community clusters, the modularity metric is used as a scoring function to assess the quality of detected community clusters with and without applying the proposed pre-processing phase. Figure 9 shows the averaged Q values, plotted for each community detection algorithm. As shown in this figure, in most cases using the proposed pre-processing approach has resulted in a slightly lower modularity than the conventional community detection methods. However, the difference is negligible and the results suggest that the proposed approach is a promising and powerful tool to assist in the fine tuning of different sources of information in community clustering area.

Fig. 9
figure 9

Modularity index vs missing edges for: a Caltech36 university dataset, b Reed98 university dataset, c Haverford76 university dataset, d Vassar85 dataset

Moreover, the comparison between Figs. 7, 8 and 9 shows that while the approach achieves a good modularity quality that is comparable with the conventional methods, the approach is significantly more effective in terms of both number and size of communities detected where the network structure is found to have some unreliable or missing information.

Table 2 shows the overall performance results of the proposed method using different types of source information.

Table 2 Performance comparison of the proposed approach using different types of information

6 Conclusion and future work

In this paper, an optimisation tool for the existing community detection algorithms is proposed. This tool could be used as a pre-processing stage that makes use of attribute information, shared neighbours and connectivity information aspects of the network to build a hybrid similarity matrix. Because the attributes in a network usually do not play equally important roles in clustering tasks, the proposed approach assigns a weighting value to each attribute during the process of building hybrid similarity matrix to reflect the relative importance of each attribute.

Besides the attribute weighting parameter, the approach required the specification of two more parameters α and β; these control the degree of contribution of connectivity information, attribute similarity and shared neighbours information for a good balance between them. The sensitivity of the pre-processing approach to α and β parameters is analysed. In addition, a simple but effective model for determining attribute weighting value and α and β values of the approach to achieve an optimal result is provided.

A Jaccard similarity coefficient is used to denote attribute similarity between nodes and combined with adjacency matrix (links information). The approach is tested in conjunction with three traditional algorithms (Newman greedy algorithm, Louvain greedy algorithm and Neman spectral optimisation) popular in the literature by applying to three real-life Facebook data networks. Experimental results demonstrate that this approach yields better effectiveness and robustness than the state-of-the-art algorithms over noisy networks.

The proposed approach utilises a similarity function for comparing attributes. In a wide range of real-life applications, data contain a mixed type of attributes (e.g. numerical, categorical). Therefore, it is important to use appropriate similarity metrics to correctly measure the attribute proximity between two nodes in the network. However, the appropriate choice of the similarity measure depends on the attribute type of network to study. An interesting guideline to extend this research work is to use a more sophisticated approach that supports datasets with mixed attribute types. Furthermore, we have already developed a set of ‘decentralised algorithms’ for community clustering. We will be evaluating these algorithms with the pre-processing scheme proposed in this paper. We will also explore using the smartphone datasets (3.3 TB) collected by the University of Cambridge using Device Analyzer.