Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Consider the problem of evaluating and improving performances of teams in organizations based on the employee responses to questionnaires. The teams differ in size, and the roles of employees may be different for every organization. A key question for an organizations top management is how to support the autonomy of these teams while still keeping an eye on the overall process and the coherency of the teams performance. Assuming a span of control of 10–15 direct reports for an average manager, a middlesize organization may easily comprise of hundreds of teams. So, pattern recognition in organizational development may supply fundamentally important information of how similar – or dissimilar – teams are [1, 15, 20]. A possible solution is to focus at the diversity within a team – is there a large group of people who are all doing a similar job, or are there some isolated groups of people who are doing very different from the rest? Identifying such groups – clusters of employees – would help to compare the organizational structures on a higher level.

More formally, in this paper we focus on comparing sets (teams) of different samples (employees), residing in different feature spaces (evaluation questions). Comparing the team structures would be equivalent to comparing similarity matrices, with each similarity matrix originating from a single team. Comparing similarities alleviates the problem of different feature spaces, yet is still not trivial because the sets can be of different sizes, and there are no natural correspondences between the samples.

Comparing distance matrices has links with comparing graph structures: a distance matrix between N samples can be seen as a fully connected graph with N nodes, where the nodes are unlabeled and the edges are associated with weights. In graph-based pattern recognition, approaches such as graph edit distance [3, 21] or graph kernels [10, 12] have been used to define distance or similarity measures between graphs. Graph matching approaches search for a best correspondence between the nodes and define the graph distance as a measure of discrepancy between the matched nodes and edges. Graph kernels define similarity by considering all possible correspondences. However, the search space for correspondences becomes very large if the nodes are unlabeled, and the graph is fully connected. In [16] we used a threshold on the distances to reduce the number of edges. However, this threshold had a large influence on the results, suggesting that the larger distances can, too, be informative.

To avoid removing informative edges and to present a computationally efficient solution, in this paper we focus on finding feature representations to represent distance matrices. By representing each distance matrix in the same feature space, they can be compared with each other, for example, using the Euclidean distance. We investigate several representations in this paper, based on spectra [6], histograms of all distances [18], histograms of nearest neighbor distances, and hubness properties [23]. A detailed description of the representations is given in Sect. 2.

In Sect. 3 we investigate how well these features representations can encode the class information for some artificial examples. In Sect. 4 we investigate how good these representations are for multiple instance learning (MIL) datasets, where the goal is to classify sets of feature vectors. In Sect. 4.2 we apply the representation on real-world organisational data, and discuss some of the insights that arise from comparing teams of people.

2 Dissimilarity Matrix Representation

We assume we have a collection of N square dissimilarity matrices \(\{D_n\in \mathbb {R}^{m_n\times m_n}; n=1...N\}\) of size \(m_n\times m_n\). One element of matrix \(D_n\) is indicated by \(D_n(i,j)\). We assume that the matrices have the following characteristics:

  • The dissimilarities of objects to themselves is zero (i.e. the \(D_n\) have zeros on the diagonal), and the dissimilarity is symmetric (\(D_n(i,j)=D_n(j,i)\)). In situations the matrices are not symmetric, they are made symmetric by averaging \(D_n\) and its transpose: \(\tilde{D}_n = (D_n+D_n^\top )/2\).

  • The size \(m_n\times m_n\) of the matrices can be different for each \(D_n\). It is assumed that the matrices have a minimum size of \(3\times 3\).

  • The order of the rows and columns is arbitrary and may be permuted without altering the information that is stored in the dissimilarity matrix.

For this data type we investigate a few simple vector representations. Additional similarities between the dissimilarity matrices are also possible (such as embedding each matrix into a low-dimensional space and using the earth-movers distance, or matching the rows and columns of the dissimilarity matrices and computing the Frobenius norm [14]), but these tend to be computationally expensive. Here we focus on vector representations of the dissimilarity matrices.

We consider the following representations:

  1. 1.

    spectrum features spect: use the k largest eigenvalues \(\sigma \) of the centered matrix \(D_n\):

    $$\begin{aligned} \mathbf {x}_n = \sigma _{1:k}(C^\top D_n C),\quad \text {where}\,\, C = \mathbb {I}_{m_n}-\frac{1}{m_n}\mathbf {1}\mathbf {1}^\top \end{aligned}$$
    (1)
  2. 2.

    histogram of distances hist: collect all dissimilarities from all matrices, split the range of dissimilarity values from 0 to the maximum into k equally-sized bins, and count for each \(D_n\) the number of occurrences into each bin. Optionally, the count can be converted into a frequency by dividing by \(m_n(m_n+1)/2\).

  3. 3.

    equalized histogram of distances histeq: split the range of dissimilarity values into k bins with an equal number of counts (instead of using equally wide bins). The bins become wider when dissimilarity values do not occur often, and they become smaller for frequently appearing values.

  4. 4.

    histogram of the k-nearest neighbor distances distnn: instead of collecting all dissimilarities, only the dissimilarities up to the k-nearest neighbors are used. Per row of \(D_n\), only \(k< m_n\) dissimilarities are used; the total number of dissimilarities is therefore reduced from \(m_n(m_n+1)/2\) to \(m_n k\). By this variations in local densities are captured better.

  5. 5.

    histogram of how often samples are the k-th nearest neigbor of other samples disthub: a measure used in hub analysis [24]. First the dataset is represented by a k-occurence histogram which stores how often each sample is the k-th nearest neighbor of others. To make this representation comparable across datasets of different sizes, it is summarized by q quantiles of the histogram. For the final representation, we concatenate the quantile-histograms for different values of \(k \in \{1,3,\ldots ,|K|\}\), resulting in a \(|K|\times q\) dim. feature vector.

In some situations we might want to be invariant to (non-linear) scaling of the dissimilarity values. For example, the expert may only have provided a relative ranking, but not an exact dissimilarity between two elements of a set. In this case, the extracted features should be invariant to the scaling of the dissimilarities. In the above representations, only disthub is invariant.

3 Illustrative Examples

To show the characteristics of the different representations, we construct some multi-class artificial datasets. Depending on the experiment we perform, the number of dissimilarity matrices, and the sizes of the matrices are varied.

  • The cluster dataset is constructed to investigate how well the clustering structure can be characterized. In cluster the dissimilarity matrices are computed from 2-dimensional datasets, containing samples belonging to a varying number of clusters (up to four clusters). The class label of a dissimilarity matrix is equal to the number of clusters, and therefore this defines a 4-class classification problem.

  • The subspace dataset is constructed to investigate how well the subspace structure can be characterized. In subspace the dissimilarity matrices are derived from p-dimensional Gaussian distributions, where the dimensionality is one (\(p=1\)) for class 1, \(p=2\) for class 2, up to class 4.

  • The outlier dataset is used to investigate the sensitivity to outliers. In outlier the matrices are derived from 2-dimensional Gaussian distributions (zero mean, identity covariance matrix). Class 1 does not contain outliers. Class 2 contains an outlier from a Gaussian distribution with a 10 times larger covariance matrix, and for class 3 contains two such outliers.

Fig. 1.
figure 1

The different feature representations for dissimilarity matrices derived from the cluster dataset with one, two, three or four clusters.

Figure 1 shows the five different representations for a sample of 100 dissimilarities drawn from the cluster dataset. For the spectrum representation three features are computed, for the (equalized) histograms \(k=10\) and for the disthub representation in total 75 features are computed. This cluster dataset has a very clear structure, and all the representations are able to distinguish well between the four different classes. In particular, the distinction between 1 cluster and more-than-1 cluster datasets are easy to make. For the disthub representation the distinction between the classes is less visible in the figure, due to the large difference in scales between the different features.

For each dataset, we compute the dissimilarity matrices using the Euclidean distances between the samples. We then compute the different representations, and use a linear classifier (LDA) to distinguish the different classes per dataset.

Fig. 2.
figure 2

Learning curves showing the error as a function of the number of training dissimilarity matrices. From left to right: results on cluster, subspace and outlier data.

Figure 2 shows the classification performance on the three artificial sets as function of the number of training matrices. The size of the individual dissimilarity matrices is fixed to \(m_n=30\). For many situations the (equalized) histogram is able to capture the information needed for good generalization. The histogram estimates start to suffer from noise for very small datasets and for situations where there is no clustering structure, and only the subspace dimensionality is informative. In these situations a spectrum representation is to be preferred. When very large training sizes are available, it is advantageous to use the nearest-neighbor distance histograms. Because distnn combines the histograms of the first-, second-, and all higher-nearest neighbors, this representation becomes very high-dimensional, but also very rich.

Fig. 3.
figure 3

The classification performance as function of the size \(m_n\) of the dissimilarity matrices, for the cluster data (left), the subspace data (middle) and the outlier data (right).

In Fig. 3 a similar curve to Fig. 2 is shown, only here the sizes of the individual dissimilarity matrices are varied while the number of training matrices is fixed to \(N=100\) per class. Here as well the (equalized) histograms perform well when the dissimilarity matrices are large. Then there is a sufficient number of values available to estimate histograms well. For very small matrices, and characterizing subspace structure or outliers, the spectrum performs well. Somewhat surprising, to characterize the clustering structure with small dissimilarity sizes, the nearest neighbor distances are most effective, although this tends to overfit with larger matrices.

4 Experiments

We distinguish between two sets of experiments, a supervised and an unsupervised set. For the supervised set, we have a collection of labeled dissimilarity matrices. Here we use the bags from MIL data, where the distances between the instances in one bag give one dissimilarity matrix, and each matrix is labeled according to the original bag label (positive or negative). For the unsupervised set we only have a collection of dissimilarities between teams of people, for which we want to investigate how much variability is present in the teams, and what constitutes this variability.

4.1 Supervised Experiments: Multiple Instance Learning

We look at a wide variety of multiple instance learning (MIL) problems. In MIL, the i-th sample is a bag \(B_n = \{\mathbf {x}_{n1}, \mathbf {x}_{n2}, \ldots , \mathbf {x}_{nm_n}\}\) of \(m_n\) instances. The goal is to classify bags, based on the presence of concept feature vectors, or based on the overall distribution of the bag’s instances. Consider image classification, where a bag is an image, and an instance is an image patch. When classifying images of tigers, a patch containing a tiger is an example of a concept instance. When classifying images of scenes, it might be more reasonable to examine several patches before deciding what type of environment the image is depicting.

Characteristics of the datasets are listed in Table 1. From our previous experiences with these datasets [5, 25], we expect these datasets to contain a mix of concept-like and distribution-like problems. Note that in our previous work [5, 25] we represented each bag by its dissimilarities relative to a set of prototype bags, whereas here we use an absolute representation where each bag is represented by dissimilarities between its own instances.

Table 1. Characteristics of MIL datasets. Most of the datasets are available for download from http://www.miproblems.org

We removed bags that contained only 1 or 2 instances. We then represented each bag by a \(m_n \times m_n\) dissimilarity matrix between its instances, where the dissimilarity is simply the Euclidean distance between the feature vectors. We represented each dissimilarity matrix with the representations described in Sect. 2. We used two classifiers: a linear discriminant classifier and a 1-nearest neighbor classifier. The experiments were performed using 10-fold cross-validation, where the best hyper parameter for each representation type (the optimal value for k), was determined on the training set using a second internal 10-fold cross-validation. We choose \(k \in \{5, 10, 25, 50, 100\}\).

We report the AUC performances of both classifiers, using the best parameters for each representation type. For reference, we also list the best performance of traditional MIL classifiersFootnote 1. The classifiers that often perform well are MILES [4], MI-SVM [2], EM-DD [27], a logistic classifier trained on a bag summary representation (based on the mean instance, or the min/max values per feature) [11], and p-posterior classifier [26] (Table 2).

Table 2. AUC mean (standard deviation) \(\times 100\,\%\) of two classifiers on five representations of MIL bags. Bold = best or not significantly worse than best representation per classifier.

The results are similar to those on artificial data: when the dissimilarity matrices are small, a spectrum representation is preferred. When larger training sets are available, it is often good to choose for an equalized histogram. These histograms tend to become relatively high dimensional, and the classifier can therefore not be too complex, so a linear classifier is a good choice.

What is also surprising is that, although these representations remove the absolute locations of the instances in the feature space, it is still possible to achieve very reasonable classification performance. For some datasets classification performances exceed the best performances achieved up to now (comp.graphics, Biocreative) or are comparable (Corel African, alt.atheism, Harddrive). For datasets that contain a specific concept (Musk1, Musk2, AjaxOrange, Web recomm. 1, Brown Creeper), the classifier that has access to individual feature vectors is better off.

4.2 Unsupervised Experiments: Analysis of Teams of People

Given the required speed in a strategic decision making process, we used an online survey for the unsupervised gathering of a strategic status update from 20,191 employees in 1,378 teams in 277 different client projects on, for example, Human Resource Management, Information Technology and Marketing and Sales. We did not use a Likert scale given the subsequent need for statistical corrections for the structure of the survey [9], for various response styles [7], for a variety of sampling errors [19] and for a wide variety of biases [22]. Instead, we opted to use a Guttman scale with objective verifiable answers [8, 13]. The assessment questions were different for different teams. Four different types of assessment can be distinguished: (1) human resource (HR): focusing on team effectiveness, competency assessments, cultural aspects, (2) strategy: how strategy is finally incorporated, innovation assessment, (3) marketing and sales: analysis of client processes, commercial guidance of shops, and (4) IT: project assessment, IT processes, IT governance. From the answers a dissimilarity is derived by computing the pairwise Euclidean distances between the answer scores of all the members in a team.

Fig. 4.
figure 4

t-SNE visualisations of 1378 teams of people. Left: spectrum representation of all the teams. The size of the circles indicate the size of the corresponding dissimilarity matrix. Right: equalized histogram representation for teams that got a human resource assessment.

In Fig. 4 the resulting embeddings are shown using the spectrum and equalized histogram representations. Both representations are 5-dimensional, and the 2-dimensional embedding of the 5D data is obtained by using t-SNE [17]. In the left subplot the marker size indices the size \(m_n\) of the corresponding matrix \(D_n\). It appears that the first important component is the size team. The plot also shows that there is more variation in the smaller teams, suggesting that in smaller team there is more possibility of specialisation. Larger teams tend to become more similar.

When we normalise for team size, and we focus on one type of questionnaires (Human Resource) we obtain the scatterplot on the right. There is one prominent outlier team. This appears to be a team that got a questionnaire with 160 questions, while normally less than 20 questions are used. Furthermore, there is a large cluster on the left, which contains fairly homogeneous team members, and a long tail up to the right where teams get stronger and stronger clusters of subteams. The teams most far in the tail show a clear clustering, while teams more close to the homogeneous cluster only contain a few outliers in a team.

5 Conclusions

We compared several feature vector representations for characterizing (square) dissimilarity matrices that can vary in size, and for which the rows and columns can be arbitrarily permuted. The spectrum representation is very effective, in particular when the sample sizes are small. It can not only characterize the intrinsic dimensionality, it is also able to characterize cluster structure. When a large sample size is available, it is often advantageous to use the more descriptive histograms of distances. These results can be observed in some artificial, and some real-world MIL problems. For MIL, the representations with a linear or nearest neighbor classifier are sometimes competitive to state-of-the-art classifiers.

We then used the representations in an unsupervised manner in order to characterize real-world organizations. Our analysis revealed some clusters of organizations, that could be interpreted by an expert. Given the current dissimilarity scores we suggest further research into the extent to which organizations are similar with respect to issues that affect a multitude of teams (a top management issue), a single team (a middle management issue) or a single employee (a lower management issue), and whether that similarity is particularly present in specific management topics (for example, in Human Resource Management) and/or in specific industries (e.g. in Professional Services).