Cluster validation techniques
One of the most important issues in cluster analysis is the validation of clustering results. Essentially, cluster validation techniques are designed to find the partitioning that best fits the underlying data and should, therefore, be regarded as a key tool in the interpretation of clustering results. The data-mining literature provides a range of different cluster validation measures, which are divided into two major categories: external and internal [12]. External validation measures have the benefit of providing an independent assessment of clustering quality, since they evaluate the clustering result with respect to a pre-specified structure. However, previous knowledge about data is rarely available. Internal validation techniques, on the other hand, avoid the need for using such additional knowledge, but have the problem that they need to base their validation on the same information used to derive the clusters themselves. Additionally, some authors consider a third approach of clustering validity, which is based on the relative criteria [9]. The basic idea is to evaluate the clustering result by comparing clustering solutions generated by the same algorithm but with different input parameter values. A number of validity indices have been defined and proposed for each of the above approaches. Furthermore, internal measures can be split with respect to the specific clustering property they reflect and assess to find an optimal clustering scheme: compactness, separation, connectedness, and stability of the cluster partitions. Compactness evaluates the cluster homogeneity that is related to the closeness within a given cluster. Separation demonstrates the opposite trend by assessing the degree of separation between individual groups. The third type of internal validation measure (connectedness) quantifies to what extent the nearest neighboring data items are placed into the same cluster. The stability measures evaluate the consistency of a given clustering partition by clustering from all but one experimental condition. The remaining condition is subsequently used to assess the predictive power of the resulting clusters by measuring the within-cluster similarity in removed experiment. A detailed summary of different types of validation measures can be found in [8, 31].
Since none of the clustering algorithms perform uniformly best under all scenarios, it is not reliable to use a single validation measure, but instead to use a few that reflect various aspects of a partitioning. In this sense, we have selected to use two internal validation measures that cover the first three categories from the above classification, for analyzing the internal communication at the organization level: Silhouette Index (SI) for assessing compactness and separation properties and Connectivity for assessing connectedness.
Silhouette Index [25] is a cluster validity index that is used to judge the quality of any clustering solution C = {C1, C2, …, Ck}. Suppose ai represents the average distance from object i to the other objects of the cluster to which the object is assigned, and bi represents the minimum of the average distances from object i to objects of the other clusters. Then, the Silhouette Index of object i can be calculated by
$$s\left( i \right) = \left( {b_{i} - a_{i} } \right)/\hbox{max} \left\{ {a_{i} ,b_{i} } \right\}.$$
The overall Silhouette Index for clustering solution C of m objects is defined as:
$$s\left( C \right) = \left( {1/m} \right)\mathop \sum \limits_{i = 1}^{m} \left( {b_{i} - a_{i} } \right)/\hbox{max} \left\{ {a_{i ,} b_{i} } \right\}.$$
The values of Silhouette Index vary from − 1 to 1 and higher value indicates better clustering results.
Connectivity captures the degree to which objects are connected within a cluster by keeping track of whether the neighboring objects are put into the same cluster [8]. Define mij as the jth nearest neighbor of object i, and let \(x_{{im_{ij} }}\) be zero if i and mij are in the same cluster and 1/j otherwise. Then for a particular clustering solution C = {C1, C2,…, Ck} of m objects and a neighborhood size nr, the Connectivity is defined as:
$${\text{Conn}}(C) = \mathop \sum \limits_{i = 1}^{m} \mathop \sum \limits_{j = 1}^{{n_{r} }} x_{{im_{ij } .}}$$
The connectivity has a value between zero and \(mH_{{n_{r} }}\) and should be minimized.
External validation measures can be two types: unary and binary [9]. Unary external evaluation measures take a single clustering result as the input and compare it with a known set of class labels to assess the degree of consensus between the two. Comprehensive measures like the F-measure provide a general way to evaluate this [30]. In addition to unary measures, the data-mining literature also provides a number of indices, which assess the consensus between a produced partitioning and the existing one based on the contingency table of the pairwise assignment of data items. Most of these indices are symmetric and are, therefore, equally well suited for the use as binary measures, i.e., for assessing the similarity of two different clustering results. Probably, the best known such index is the Rand Index [24], which determines the similarity between two partitions as a function of positive and negative agreements in pairwise cluster assignments.
We plan to use the F-measure as an external validation measure to match the clustering solutions generated by k-means (or other clustering algorithms) on the data presenting the internal communication at the organization level (i.e., the informal organizational structure) to the existing organizational structure that partitions the employees into working units.
The F-measure is the harmonic mean of the precision and recall values for each cluster [19]. Let us consider two clustering solutions C = {C1, C2, …, Ck} and C′ = {C1′, C2′,…, Cl′} of the same data set. The first solution is a known partition of the considered data set while the second one is a partition generated by the applied clustering algorithm. The F-measure for a cluster Cj′ is then given as:
$$F\left( {C_{j}^{\prime } } \right) = \frac{{2\left| {C_{i} \cap C_{j}^{\prime } } \right|}}{{\left| {C_{i} } \right| + \left| {C_{j}^{\prime } } \right|}},$$
where Ci is the cluster that contains the maximum number of objects from Cj′. The overall F-measure for clustering solution C′ is defined as the mean of cluster-wise F-measure values, i.e, \(F\left( {C^{\prime } } \right) = 1/\mathop \sum \nolimits_{j = 1}^{l} F\left( {C_{j}^{\prime } } \right).\) For a perfect clustering, when l = k, the maximum value of the F-measure is 1.
Analysis of organizational structure
Suppose we have a database D that contains email interactions among employees in a company for a given time period. Assume that n employees in total work for the company. Using the email database, we initially create an interaction matrix, where entry eij (i, j = 1, 2,…, n) represents the number of emails sent by employee i to employee j. The interaction matrix is normalized by dividing each eij (i, j = 1, 2,…, n) with the total number of emails, denoted by ei, sent by employee i, i.e., each value enij in the normalized interaction matrix is equal to eij/ei (i, j = 1, 2,…, n). Next, using the normalized interaction matrix we can construct an undirected graph, where vertices represent employees and edges represent communication between two employees. The graph can be represented by an adjacency matrix, where each entry of the matrix is a value dij \(\in\) [0, 1] (i, j = 1, 2,…, n) expressing the distance (dissimilarity) between employees i and j. The distance dij (i ≠ j) can be defined by the following equality
$$d_{ij} = 1 - 0.5\left( {en_{ij} + en_{ji} } \right).$$
(1)
In addition, dii = 0 by default, for all i = 1, 2,…, n. It can easily be shown that dij \(\in\)[0, 1] for all i, j = 1, 2,…, n.
Notice that in [3], the distance matrix has been normalized by the maximum number of exchanged emails between two employees in the company. However, this can be misleading if there are two employees who exchange quite a lot emails in comparison with the others. This will reduce the distance for all employees. Therefore, in the current study we normalize each row i (i = 1, 2,…, n) of the interaction matrix by the total number of emails sent by employee i (ei). Evidently, then the normalized value enij represents the relative intensity (weight) of communication of employee i with employee j with respect to the other employees in the company, i.e., eni1 + eni2 + ··· + enin = 1 (i = 1, 2,…, n). In this way, in the normalized interaction matrix employees with very few email communications will have very close weights to ones of employees with many email communications. The latter is the necessary condition for our further analysis of the email communication data of the company, since we need to compare any two employees with respect to their relative level of interaction with the others.
Next suppose the employees are partitioned into k main working units (divisions, departments), i.e., the existing formal organizational structure determines a cluster partition. This cluster partition is considered as a known partitioning of the employees in the company and it is denoted by C = {C1, C2,…, Ck}, where Ci (i = 1, 2,…, k) presents an organizational division. The cluster validation measures introduced in the foregoing section can be used to evaluate whether this known partitioning of the employees into k divisions reflects a clustering (informal) structure present in the email interaction database D. For example, we can study whether the divisions are of a high quality, i.e., the “within” communications are high in comparison with the “between” communications. In addition, it would be useful to have an evaluation of the overall organizational structure. We can further investigate which employees appear to be well assigned (classified), which ones are wrongly assigned (misclassified), and which ones lie in between divisions. We can also try to obtain an idea about the number of “natural” divisions that are present in the email interaction database.
The Silhouette Indices discussed in Sect. 4.1 can easily be constructed in the considered context and can be used to assess whether we have compact and clearly separated organizational units. In addition, the Connectivity scores can be used to evaluate whether employees are well connected within a division. In order to calculate SI and Connectivity scores, we only need the partitioning of the employees into the divisions and the collection of all distances between employees, i.e., the distance matrix. Thus, we can compare the performance of the different divisions with respect to the effectiveness of their email communication using SI and Connectivity validation measures.
The worst performing divisions (the ones with lowest SI and highest Connectivity scores) can further be analyzed by computing individual Silhouette Indices of their employees. The computed Silhouette Indices for such a division, e.g., division Cp, can be depicted in a plot. In addition, for each employee i \(\in\) Cp we can find the second-best (or best in case of negative SI) choice (e.g., division Cr), i.e, if he/she could not be allocated to this division, which division would be the closest competitor. We can use the plot built for division Cp and look initially for employees whose SI scores are negative. Let us consider an extreme case when there is an employee i \(\in\) Cp whose Silhouette Index value [denoted by s(i)] is close to −1. This means that employee i is much closer to division Cr than to Cp, i.e., this person communicates much more with employees from division Cr than with people from his/her own division Cp. Therefore, it would have seemed much more natural to assign employee i to division Cr. In general, all such employees can be found in the worst performing divisions, and it could be recommended to allocate these employees to their closest division.
We have a different situation when s(i) of employee i \(\in\) Cp is about zero. Then, it is not clear whether employee i should have been assigned to Cp or Cr. If such situations are predominating in the analyzed divisions (i.e., most employees have silhouette indices that are close to zero), this could be an indication that the current organizational structure generally needs to be improved. This is especially valid when the overall Silhouette Index for the current organizational structure is close to zero or negative.
We can also examine the effect of migration of groups of employees on organizational performance. For instance, we can move groups of employees from the worst performing divisions to other available divisions and evaluate how this affects communication behavior of the divisions. In this way, we are able to perform the above discussed analysis but on a different level of organizational granularity; in this case, the organization is considered as built by the different working groups (teams) instead by the individual employees. In [3], we have not studied and validated this reorganization scenario.
A validation approach, which is based on the relative criteria and does not involve statistical tests, is discussed in [9]. Its general idea is to choose the best clustering scheme of a set of defined schemes according to a pre-specified criterion. We can apply this idea to our context in order to get some clue about the optimal number of organizational units, i.e., the number of units that is supported by the underlying email communication database. This can be implemented by trying to generate clustering results for a range of different numbers of clusters (divisions) and subsequently assess the quality of the obtained clustering solutions. For example, SI and Connectivity measures can be used as validity indices to identify the best clustering scheme. A number of other principled ways, such as AIC [1] and BIC [28], can also be used to determine the number of clusters apart from the above method.
We then can run some clustering algorithm (e.g., k-means) in order to partition the employees into the found optimal number of divisions. The F-measure introduced in the foregoing section can further be applied to match the generated cluster partition with the current organizational structure and assess to what degree this partition differs from the existing one. This might indicate whether to recommend a future redesigning of the company structure, e.g., if the F-measure score is below a given threshold.
Let us suppose that we have two databases D1 and D2 that contain email communications among employees before and after a company reorganization, respectively. In this context, it would be useful if we can compare the two organizational structures to determine which is better. We would like to know whether the implemented structural changes have led to improved information flow and email communication. We can evaluate and compare the two cluster partitions (informal organizational structures) determined by the two email communication databases D1 and D2 with respect to SI and Connectivity measures. In addition, the Rand Index mentioned in the foregoing section can be used for assessing the similarity of the formal and informal organizational structure before and after the reorganization, i.e., two similarity scores can be generated. For example, if the former score (the Rand Index calculated before the reorganization) outperforms the latter one this can be a sign that the implemented structural changes have not improved the communication behavior in the organization.
We can calculate the individual SI for the employees within the organization and rank the employees in decreasing order w.r.t. the calculated SI scores. In addition, it is possible to rank the employees with respect to their hierarchical position in the organization. The hierarchical position (HP) is a measure that denotes the importance of an employee within the organization [17]. The HP of each employee i can be calculated as follows:
$${\text{HP}}(i) = \frac{{\mathop \sum \nolimits_{j \ne i} D(i,j)}}{n - 1},$$
(2)
where D(i, j) is a hierarchical difference between employees i and j, and n is a number of employees within the organization. The hierarchical difference D(i, j) is computed as follows:
$$D(i, j) = \left\{ {\begin{array}{*{20}l} {1, \qquad\text{if}\,i\,\text{is}\,\text{higher}\,\text{in}\,\text{the}\,\text{hierarchy}\,\text{than}}\,j \\ {0,\qquad \text{if}\,i\,\text{and}\,j\,\text{are}\,\text{at}\,\text{the}\,\text{same}\,\text{level}\,\text{of}\,\text{the}\,\text{hierarchy}}\\ {-1, \quad{\text{if}}\,i\,\text{is}\,\text{lower}\,\text{in}\,\text{the}\,\text{hierarchy}\,\text{than}\,j} \\ \end{array} }\right..$$
Kendall’s rankings comparison method can further be used to compare two rankings [14]. It compares the nodes (employees) in pairs, i.e., the positions of pair employees within both rankings. If the position of employee i is related to the position of employee j in both rankings monotonically in the same direction, then this pair is well correlated. It is assumed that when the level in hierarchy is the same within the pair, then it does not matter whether they are in different positions in the ranking generated based on SI scores. Kendall’s rank correlation coefficient is a value in the range [− 1, 1], where 1 means that two rankings are perfectly correlated and − 1 means that they are completely different (in the opposite order). By comparing the two ranking, we can analyze whether employees who are high in the official hierarchy (ordered by HP) are also important (well-integrated) in the social network of the organization (ordered by SI). This analysis can be performed not only globally, but also locally at department level. The evaluation scenario discussed in this paragraph has not been studied and validated in our initial work published in [3].