1 Introduction: Previous Work and Motivation

Community detection is a popular field of data science with various applications ranging from sociology to biology to computer science. Recently this concept was extended from flat and weighted networks to networks with a feature space associated with its nodes. A community is a group, or cluster, of densely interconnected nodes that are similar in the feature space too. There have been published a number of papers proposing various approaches to identifying communities in feature-rich networks (see recent reviews in [8] and [3]). They naturally fall in three groups: (a) those heuristically transforming the feature-based data to augment the network format, (b) those heuristically converting the data to the features only format, and (c) those involving, usually, a probabilistic model of the phenomenon to apply the maximum likelihood principle for estimating its parameters. A typical method within approach (a) or (b) combines a number of heuristical approaches, thus involving a number of unsubstantiated parameters which are rather difficult to put to a system, the more so to testing. Most interesting approaches in the modeling group (c) are represented by methods in [21] and [16]. The former statistically models inter-relation between the network structure and node attributes, the latter involves Bayesian inferences.

Our approach relates to that of modeling, except that we model the data rather than the process of data generation. Specifically, our data-driven model assumes a hidden partition of the node set in non-overlapping communities and parameters encoding the average within-community link intensity and feature central points. To find this partition and parameters, we apply a combined least-squares criterion to recover the data from the partition. We propose a greedy-wise procedure for finding clusters one-by-one, as already proved successful in application to both feature data only and network/similarity data only [2, 12]. In contrast to other approaches, this one is applicable to mixed scale data after categories are converted into 1/0 dummy variables considered as quantitative ones. Our experiments show that this approach is valid and competitive against state-of-the-art approaches.

The rest of the paper is organized as follows. We describe our model and algorithm in Sect. 2. In Sect. 3, we describe the setting of our experiments. In Sect. 4, we describe results of our experiments to validate our method and compare it with competition. We draw conclusions in Sect. 5.

2 A Least Squares Criterion

Let us consider a dataset represented by two matrices: a symmetric \( N\times N \) network adjacency matrix \(P=(p_{i j})\), where \(p_{ij}\) can be any reals, and by an \(N \times V\) entity-to-feature matrix \(Y=(y_{iv})\) with \(i\in I\), I being an N-element entity set.

We assume that there is a partition \(S=\{S_1, S_2,..., S_K\}\) of I in K non-overlapping communities, a.k.a. clusters, related to this dataset as described below.

Denote k-th cluster binary membership vector by \(s_k =(s_{ik})\), \(k=1,2,..., K\), so that its i-th component is equal to unity for \(i \in S_k\), and zero otherwise. The cluster is assigned with a V-dimensional center vector \(c_k= (c_{k v})\). Also, there is a positive network intensity weight of k-th cluster denoted by \(\lambda _k\), to adjust the binary \(s_{ik}\) values to the measurement scale of the network adjacency matrix P.

Equations (1) and (2) below:

$$\begin{aligned} y_{iv} = \sum _{k=1}^{K} s_{ik}c_{kv} + f_{iv}, i\in I, v\in V, \end{aligned}$$
(1)
$$\begin{aligned} p_{i j} = \sum _{k=1}^K \lambda _{k} s_{ik} s_{jk} + e_{ij}, i,j\in I. \end{aligned}$$
(2)

express our model. Here values \(e_{ij}\) and \(f_{iv}\) are residuals that should be made as small as possible.

According to the least-squares principle, “right” membership vectors \(s_k\), community centers \(c_k\) and intensity weights \(\lambda _k\) are minimizers of the summary least-squares criterion:

$$\begin{aligned} F(\lambda _k, s_{k}, c_{k}) = \rho \sum _{k=1}^{K} \sum _{i v} (y_{iv} - c_{k v} s_{i k})^2 + \xi \sum _{k=1}^{K} \sum _{i j}( p_{i j} -\lambda _{k} s_{ik} s_{jk})^2 \end{aligned}$$
(3)

The factors \(\rho \) and \(\xi \) in Eq. (3) are expert-driven constants to balance the two sources of data.

On the first glance, criterion in Eq. (3) differs from what follows from Eqs. (2) and (1): the operation of summation over k is outside of the parentheses in it, whereas these equations require that to be within the parentheses. However, the formulation in (3) is consistent with the models in (2) and (1) because vectors \(s_k\) (\(k=1, 2,..., K\)) correspond to a partition and thus are mutually orthogonal: For any specific \(i\in I\), \(s_{ik}\) is zero for all k except one; that one k at which \(i\in S_k\). Therefore, each of the sums over k in Eqs. (2) and (1) consists of just one item, so that the summation sign may be applied outside of the parentheses indeed.

To use a one-by-one clustering strategy [13] here, let us denote an individual community by S; its center in feature space, by c; and the corresponding intensity weight, by \(\lambda \) (just removing the index, k, for convenience). The extent of fit between the community and the dataset will be the corresponding part of criterion in (3):

$$\begin{aligned} F(\lambda , c_{v}, s_{i}) =\rho \sum _{i,v} (y_{iv} - c_{v} s_{i})^2 + \xi \sum _{i,j}( p_{ij} -\lambda s_{i}s_{j})^2 \end{aligned}$$
(4)

The problem: given matrices \(P=(p_{ij})\) and \(Y=(y_{iv})\), find binary s, as well as real-valued \(\lambda \) and \(c=(c_v)\), minimizing criterion (4).

As is well known, and, in fact, easy to prove, the optimal real-valued \(c_v\) is equal to the within-S mean of feature v, and the optimal intensity value \(\lambda \) is equal to the mean within-cluster link value:

$$\begin{aligned} c_{v} = \frac{\sum _{i\in S}y_{iv}}{|S|}; \ \lambda = \frac{\sum _{i,j\in S}p_{i j}}{|S|^2} \end{aligned}$$
(5)

Criterion (4) can be further reformulated as:

$$\begin{aligned} \begin{aligned} F(s) = \rho \sum _{i,v} y_{iv}^2 -2 \rho \sum _{i,v} y_{iv} c_{v} s_{i} + \rho \sum _{v}c_{v}^2 \sum _{i} s_{i}^2 \ + \\ \xi \sum _{i,j} p_{i j}^2 - 2 \xi \lambda \sum _{i,j} p_{i j} s_{i} s_{j} + \xi \lambda ^2 \sum _{i} s_{i}^2 \sum _{j} s_{j}^2 \end{aligned} \end{aligned}$$
(6)

The items \(T(Y)=\sum _{i, v} y_{iv}{^2}\) and \(T(P)=\sum _{i j} p_{i, j}^{2} \) in (6) express quadratic scatters of data matrices Y and P, respectively. Using them, Eq. 6 can be reformulated as

$$\begin{aligned} F(s)=\rho T(Y) +\xi T(P) - G(s) \end{aligned}$$
(7)

where

$$\begin{aligned} G(s) = 2 \rho \sum _{i,v} y_{iv} c_{v} s_{i} -\rho \sum _{v}c_{v}^2 \sum _{i} s_{i}^2 + 2 \xi \lambda \sum _{i,j} p_{i j} s_{i} s_{j} - \xi \lambda ^2 \sum _{i} s_{i}^{2} \sum _{j} s_{j}^{2} \end{aligned}$$
(8)

Equation (7) shows that the combined data scatter, \(\rho T(Y) +\xi T(P)\) is decomposed in two complementary parts, one of which, F(s), expresses the residual, that part of the data scatter which is minimized in Eqs. (1) and (2), whereas the other part, G(s), expresses the contribution of the model to the data scatter.

By putting the optimal values \(c_v\) and \(\lambda \) from (5) into this expression, we obtain a simpler expression for G(s)

$$\begin{aligned} G = \rho |S|\sum _{v} c_{v}^2 + \xi \lambda \sum _{i j} p_{i j} s_{i}s_{j} \end{aligned}$$
(9)

Maximizing G in (9) is equivalent to minimizing criterion F in 4 because of 7.

One can see that maximizing the first item in (9) requires obtaining a numerous cluster (the greater the |S|, the better) which is as far away from the space origin, 0, as possible (the greater the squared distance from 0, \(|\sum _v c_v^2|\), the better). Usually the data are pre-processed so that the origin is shifted to the center of gravity, or grand mean, the point whose components are the averages of the corresponding features. In such a case, the goal of putting the cluster as far away from 0 as possible, means that the cluster should be anomalous. The second item in the criterion (9) is proportional to the sum of within-cluster links multiplied by the average within-cluster link \(\lambda \). Maximizing criterion (9), thus, should produce a large anomalous cluster of a high density.

We employ a greedy heuristic: starting from arbitrary singleton \(S={i}\), the seed, add entities one by one so that the increment of G in (9) is maximized. After each addition, recompute optimal \(c_v\) and \(\lambda \). Halt when the increment becomes negative. After stopping, the last check is executed: Seed Relevance Check: Remove the seed from the found cluster S. If the removal increases the cluster contribution; this seed is extracted from the cluster.

We refer to this algorithm as Feature-rich Network Addition Clustering algorithm, FNAC. Consecutive application of the algorithm FNAC to detect more than one community, forms our community detection algorithm SEFNAC below.

SEFNAC: Sequential Extraction of Feature-rich Network Addition Clusters

  1. 1.

    Initialization. Define \(J=I\), the set of entities to which FNAC applies at every iteration, and set cluster counter \(k=1\).

  2. 2.

    Define matrices \(Y_J\) and \(P_J\) as parts of Y and P restricted at J. Apply FNAC at J, denote the output cluster S as \(S_k\), its center c as \(c_k\), the intensity \(\lambda \) as \(\lambda _k\), and contribution G as \(G_k\).

  3. 3.

    Redefine J by removing all the elements of \(S_k\) from it. Check whether thus obtained J is empty or not. If yes, stop. Define the current k as K and output all the solutions \(S_k, c_k, \lambda _k, G_k\), \(k=1,2,..., K\). If not, add 1 to k, and go to 2.

3 Setting of Experiments for Validation and Comparison of SEFNAC Algorithm

To set a computational experiment, one should specify its constituents:

  1. 1.

    The set of algorithms under comparison.

  2. 2.

    The set of datasets at which the algorithms are evaluated and/or compared.

  3. 3.

    The set of criteria for assessment of the experimental results.

3.1 Algorithms Under Comparison

We take two popular algorithms in the model-based approach, CESNA [21] and SIAN [16], which have been extensively tested in computational experiments. The author-made codes of the algorithms are publicly available in [11] and [14] respectively. We also tested the algorithm PAICAN from [1] in our experiments. The results of this algorithm, unfortunately, were always less than satisfactory; therefore, we exclude the algorithm PAICAN from this paper.

3.2 Datasets

We use both real world datasets and synthetic datasets.

Real World Datasets. We take on five real-world data sets listed in Table 1. Some of them involve both quantitative and categorical features. The algorithms under comparison, unlike the proposed algorithm SEFNAC, require that features are to be categorical. Therefore, whenever a data set contains a quantitative feature we convert that feature to a categorical version.

Table 1. Real world datasets under consideration. Symbols N, E, and F stand for the number of nodes, the number of edges, and the number of node features, respectively.

Malaria data set [9]

The nodes are amino acid sequences containing six highly variable regions (HVR) each. The edges are drawn between sequences with similar HVRs number 6. In this data set, there are two nominal attributes of nodes:

  1. 1.

    Cys labels derived from of a highly variable region HVR6 (assumed ground truth);

  2. 2.

    Cys-PoLV labels derived from the sequences adjacent to regions HVR 5 and 6.

Lawyers dataset [10]

The Lawyers dataset comes from a network study of corporate law partnership that was carried out in a Northeastern US corporate law firm, referred to as SG & R, 1988–1991, in New England. It is available for downloading at [19]. There is a friendship network between lawyers in the study. The features in this dataset are:

  1. 1.

    Status (partner, associate),

  2. 2.

    Gender (man, woman),

  3. 3.

    Office location (Boston, Hartford, Providence),

  4. 4.

    Years with the firm,

  5. 5.

    Age,

  6. 6.

    Practice (litigation, corporate),

  7. 7.

    Law school (Harvard or Yale, UCon., Other)

Most features are nominal. Two features, “Years with the firm” and “Age”, are quantitative. Authors of the previous studies converted them to the nominal format, accepted here too. The categories of “Years with the firm” are \(x\le 10\), \(10< x <20\), and \(x\ge 20\); the categories of “Age” are \(x\le 40\), \(40<x<50\), and \(x\ge 50\).

World-Trade dataset [17]

The World-Trade dataset contains data on trade between 80 countries in 1994. The link weights represent total imports by row-countries from column-countries, in $1,000, for the class of commodities designated as ‘miscellaneous manufactures of metal’ to represent high technology products. The weights for imports with values less than 1% of the country’s total imports are zeroed. The node attributes are:

  1. 1.

    Continent (Africa, Asia, Europe, North America, Oceania, South America)

  2. 2.

    Structural World System Position (Core, Semi-Periphery, Periphery),

  3. 3.

    Gross Domestic Product per capita in $ (GDP p/c)

We convert the GDP feature into a three-category nominal feature according to the minima of its histogram. The categories are: ‘Poor’ if GDP p/c is less than $4406.9; ‘Mid-Range’ if GDP is between $4406.9 and $21574.5; and ‘Wealthy’ if GDP is greater than $21574.5.

Parliament dataset[1]

The nodes correspond to members of the French Parliament. An edge is drawn if the corresponding MPs sign a bill together. The features are the constituency of MPs and their political party.

Consulting Organisational Social Network (COSN) dataset [5]

Nodes in this network correspond to employees in a consulting company. The (asymmetric) edges are formed in accordance with their replies to this question: “Please indicate how often you have turned to this person for information or advice on work-related topics in the past three months”. The answers are coded by 0 (I Do Not Know This Person), 1 (Never), 2 (Seldom), 3 (Sometimes), 4 (Often), and 5 (Very Often). These 6 numerals are the weights of the corresponding edges. Nodes in this network have the following attributes:

  1. 1.

    Organisational level (Research Assistant, Junior Consultant, Senior Consultant, Managing Consultant, Partner),

  2. 2.

    Gender (Male, Female),

  3. 3.

    Region (Europe, USA),

  4. 4.

    Location (Boston, London, Paris, Rome, Madrid, Oslo, Copenhagen).

Before applying SEFNAC, all attribute categories are converted into 1/0 dummy variables which are considered quantitative.

Generating Synthetic Data Sets. First of all, we specify the number of nodes N, the number of features V, and the number of communities, K, in a dataset to be generated. As the number of parameters to control is rather high, we narrow down the variation of our data generator by maintaining two types of settings only, a small size network and a medium size network. For a small size setting, we specify the values of the three parameters as follows: \(N=200\), \(V=5\), and \(K=5\). For the medium size, \(N=1000\), \(V=10\), and \(K=15\).

Generating Networks

At given numbers of nodes, N, and communities K, cardinalities of communities are defined uniformly randomly, up to a constraint that no community may have less than a pre-specified number of nodes (in our experiments, this is set to 30, so that probabilistic approaches are applicable), and the total number of nodes in all the communities sums to N.

Given the community sizes, we populate them with nodes, that are specified just by indices. Then we specify two probability values, p and q. Every within-community edge is drawn with the probability p, independently of other edges. Similarly, any between- community edge is drawn independently with the probability q.

Generating Quantitative Features

To model quantitative features, we generate each cluster from a Gaussian distribution whose covariance matrix is diagonal with diagonal values uniformly random in the range [0.05, 0.1] to specify the cluster’s spread. Each component of the cluster center is generated uniformly random from the range \(\alpha [-1, +1]\), so that the real positive \(\alpha \) controls the cluster intermix: the smaller the \(\alpha \), the closer are cluster centers to each other.

In addition to cluster intermix, we take into account the possibility of presence of noise in data. We uniformly random generate a noise feature from an interval defined by the maximum and minimum values. In this way, we may replicate \(50\%\) of the original data with noise features.

Generating Categorical Features

To model categorical features, we randomly choose the number of categories for each of them from the set \(\{2, 3, ..., L\}\) where \(L=10\) for small-size networks and \(L=15\) for the medium-size networks. Then, given the number of communities, K, and the numbers of entities, \(N_k\) for \((k=1,..., K)\); the cluster centers are generated randomly so that no two centers may coincide at more than 50% of features.

Once a center of k-th cluster, \(c_{k}=(c_{k v})\), is specified, \(N_k\) entities of this cluster are generated as follows. Given a pre-specified threshold of intermix, \(\epsilon \) between 0 and 1, for every pair (iv), \(i=1: N_k\); \(v=1:V\), a uniformly random real number r between 0 and 1 is generated. If \(r> \epsilon \), the entry \(x_{iv}\) is set to be equal to \(c_{kv}\); otherwise, \(x_{iv}\) is taken randomly from the set of categories specified for feature v.

Consequently, all entities in cluster k-th coincide with its center, up to rare errors if \(\epsilon \) is close to 1. The smaller the epsilon, the more diverse, and thus intermixed, would be the generated entities.

Generating mixed scale features

We divide the number of features in two approximately equal parts, one to consist of quantitative features, the other, of categorical features. Each part is filled in independently, as described above.

3.3 Evaluation Criteria

To evaluate the result of a community detection algorithm, we compare the found partition with that generated by using: 1) the customary Adjusted Rand Index (ARI) [6] and 2) the Normalized Mutual Information (NMI) [4].

4 Results of Computational Experiments

The goal of our experiments is to test validity of the SEFNAC algorithm over all types of feature-rich network datasets under consideration. In the cases at which features are categorical, the SEFNAC algorithm is to be compared with the popular algorithms SIAN and CESNA.

4.1 Parameters of the Generated Datasets

We set network parameters, the probability of a within-community edge, p, and that between communities, q, to take either of two values each, \(p=0.7, 0.9\) and \(q=0.3, 0.6\). In the cases at which all the features are categorical, we decrease q-values to \(q=0.2, 0.4\), because all the three algorithms fail at \(q=0.6\). Feature generation is controlled by an intermix parameter, \(\alpha \) at quantitative features, and \(\epsilon \) at categorical features. We take each of the intermix parameters to be either 0.7 or 0.9.

To set a more realistic design, we may explicitly insert 50% features that are uniformly random in some datasets.

Therefore, generation of synthetic datasets is controlled by specifying six two-valued and one three-valued parameters: feature scales: quantitative, categorical, mixed; data size: small, medium; presence of noise features: yes, no; the probability of a within-community edge p; the probability of a between-community edge q; cluster inter-mix parameter \(\alpha \)/\(\epsilon \). Therefore, there are 192 combinations of these altogether. At each setting, we generate 10 datasets, run a community detection algorithm, and calculate the mean and the standard deviation of ARI (NMI) values at these 10 datasets.

The following two sections present our experimental results for (a) testing validity of the SEFNAC algorithm at synthetic data, and (b) comparing performance of SEFNAC and competition on both real and synthetic data.

4.2 Validity of SEFNAC

Table 2 presents the results of our experiments at synthetic datasets with mixed scale features.

Table 2. Performance of SEFNAC on synthetic networks combining quantitative and categorical features for two different sizes: The average ARI index and its standard deviation over 10 different data sets.

We can see that SEFNAC successfully recovers the numbers of communities at \(q=0.3\) and mostly fails at \(q=0.6\) – because this corresponds to a counter intuitive situation at which the probability of a link between separate communities is greater than 0.5. Yet even in this case the partition is recovered exactly when other parameters keep its structure tight, as say at \(p=0.9\). This holds for both small size and medium size cases. Insertion of noise features does reduce the levels of ARI (NMI) but not that much. The real reduction in the numbers of recovered communities, 7–8 out of 15 ones generated, occurs at the medium size data sets at really loose data structures with \(p=0.7\) and \(q=0.6\), leading to significant drops in the levels of ARI (NMI) values.

The picture is much similar at the cases of quantitative only and categorical only feature scales - we do not present them to shorten the paper.

4.3 Comparing SEFNAC and Competition

In this section, we compare the performance of SEFNAC with that of CESNA [21], and SIAN [16]. It should be reminded that SEFNAC determines the number of clusters automatically, whereas both CESNA and SIAN need that as part of the input.

Table 3 presents our results at synthetic datasets (with categorical features only, as required by the competition) and Table 4, at real world datasets.

Table 3. Comparison of CESNA, SIAN and SEFNAC at synthetic data sets with categorical features. The best results are highlighted using bold-face. The average ARI and NMI value and its standard deviation over 10 different data sets is reported.

One can see that at small sizes with regarding ARI CESNA wins three times (out of 8); while if one considers NMI, CESNA wins two more settings. At all the other cases, including at medium size datasets, SEFNAC wins. SIAN never wins in this table. There is an impressive change in the performance of SIAN at the medium-sized datasets: SIAN comprehensively fails on all counts at medium sizes by producing NaN which we interpret as a one-cluster solution.

We also experimented with a slightly different design for categorical feature generation. That different design sets an entity to either coincide with its cluster center or to be entirely random. At that design CESNA wins 7 times at the small size datasets and SEFNAC wins at 7 medium size datasets.

Real world datasets lead to somewhat different results: CESNA performs rather poorly; SEFNAC wins three times regarding ARI and two times regarding NMI, and SIAN, two times regarding ARI and three times regarding NMI (see Table 4).

Here, we chose that data normalization method leading, on average, to the larger ARI values. Specifically, we used z-scoring for normalizing features in Lawyers dataset, HVR data set and COSN data set. The best results on World-Trade data set and parliament data set are obtained with no normalization. The network data in Lawyers and HVR are normalized with applying the modularity transformation [15]. The network data of COSN is normalized by shifting all the similarities to the average link value [13].

Table 4. Comparison of CESNA, SIAN and SEFNAC on Real-world data sets; average values of ARI and NMI and their standard deviation (std) are presented over 10 random initialisations. The best results are shown in bold-face.

5 Conclusion

This paper proposes a novel combined data recovery criterion for the problem of detecting communities in a feature-rich network. Our algorithm SEFNAC (Sequential Extraction of Feature-Rich Network Addition Clusters) extracts clusters one by one. Our approach is more or less universal regarding the scales of the data available. On the other hand, SEFNAC results may depend on data normalization.

We experimentally show that SEFNAC is competitive over both synthetic and real-world data sets against two popular state-of-the-art algorithms, CESNA [21] and SIAN [16].

Possible directions for future work:

  • A systematic investigation of the relative effect of different data standardization methods on the results of SEFNAC.

  • An extension of SEFNAC to large datasets should be proposed and validated.

  • A trade-off between two constituent data sources, network and features, as expressed by factors \(\lambda \) and \(\xi \), should be investigated.