for Improved Graph Clustering of Source Code

. To perform cluster analysis on graphs we utilize graph kernels, Weisfeiler-Lehman kernel in particular, to transform graphs into a vector representation. Despite good results, these kernels have been criticized in the literature for high dimensionality and high sensitivity, so we propose an eﬃcient subtree distance measure that is subsequently used to enrich the vector representations and enables more sensitive distance measurements. We demonstrate the usefulness in an application, where the graphs represent diﬀerent source code snapshots, and a cluster analysis of these snapshots provides the lecturer an overview about the overall performance of a group of students.


Motivation
Graphs are a universal data structure and have become very popular over recent years in various domains with structured data (e.g. protein function prediction, drug toxicity prediction, malware detection, etc.). To apply existing clustering or classification techniques to graphs, either a distance (or similarity) measure is needed, or a transformation into a vector representation for which most clustering and classification algorithms were developed for. In this paper we are concerned about repeatedly clustering graphs to understand the evolution of student's source code. As will be explained in Sect. 2, we settle on Weisfeiler-Lehman (WL) graph kernels [9] to decompose the graph into subtrees and to define a similarity function over the number of common substructures across graphs. It has been criticized, however, that WL subtree kernels produce (a) many different substructures and thus only a few substructures will be common across graphs, which establishes (b) a tendency of being only similar to itself. In this paper we propose to include the subtree similarity in an efficient postprocessing step to tackle both problems: We exploit the fact that many of the substructures may be formally distinct but actually quite similar. By enriching the vector representations we obtain positive effects for the overall graph similarity.

Measuring Similarity Directly
A common approach to compare graphs is to calculate the edit distance between graphs F and G: the minimal number of steps to transform G to F . For the special case of trees, these steps consists of node deletion, node insertion, and node relabelling. A survey on tree edit distance can be found in [1], an efficient algorithmic O(n 3 ) solution, n being the maximal number of nodes in F and G, is proposed in [2]. To adapt a tree edit distance to a specific application, there are approaches to learn appropriate cost parameters [6]. With general graphs, the editing process becomes more complicated as additional operations need to be considered (edge insertion and edge deletion). A survey on graph edit distance is given in [3]. Its computation is exponential in the number of nodes and therefore infeasible for large graphs.

Measuring Similarity Indirectly
Instead of coping with the full graph, one may decompose the graph into a set of smaller entities and compare these sets instead of the graphs. These entities may be frequent subgraphs (e.g. [8]), walks (short paths), graphlets (e.g. [10]) or subtrees (e.g. [9]). Many graph kernel approaches explicitly construct a vector representation, where the i th element indicates how often the i th substructure occurs in the graph. From this vector a kernel or similarity matrix may be calculated. Recent approaches, such as subgraph2vec [5], use deep learning to translate graphs into such a vector representation.
This section particularly reviews the construction of a WL subtree kernel (following [9]), as it will be foundation of the next section. The subtree kernel transforms a graph into a vector, where a non-zero entry indicates the occurrence of a specific subtree in the graph. The total number of dimensions is determined by all subtrees that have been identified in the full set of graphs.
Given a graph G = (V, E), a label function l : V → Σ * yields for each node v ∈ V a label over a finite alphabet Σ. The initial labels l 0 (v) are provided together with the graph G (original labels). A new label function l i is obtained by calling W LSK(G, l i−1 ), which is shown in Algorithm 1: It constructs new labels by concatenating all child labels deterministically (by processing children in some lexicographic order). A series of n WLSK calls provides a sequence of n label functions l 0 , . . . , l n , where a node label l i (v) takes all children of v up to depth i into account. A label l i (v) may thus serve as a kind of fingerprint of the neighbourhood of v (hashcode).
The final vector representation of a graph is obtained from where #l j i denotes how many nodes received the label l j i . Originally this approach was proposed as a test of isomorphism [11], as isomorphic graphs exhibit identical substructures (labels). Figure 1 shows an illustrative example. On the top left we have two graphs G 1 and G 2 with nodes v 1 -v 7 and v 8 -v 14 , resp. The (numeric) label is written in the node, the node identifiers are shown in gray. The table next to the graphs shows, for each node, how the new label s is constructed from the current node label and its successors. For instance, node v 1 of G 1 has label 0 and successors with labels 2, 0, 1. Algorithm 1 creates new labels by appending the node label and the successor labels (in sorted order), which yields "0 : 0, 1, 2" for v 1 . The rightmost table shows a dictionary, where each new label (here: 0 : 0, 1, 2) gets a fresh ID (here: 3). Algorithm 1 refers to this step as hashing the node label into a new ID (or hashcode) -we use consecutive numbers just for illustrative purposes. Children need to be ordered deterministically to get the same hash for identical subtrees. The new label l 1 (v 1 ) = 3 thus encodes a subtree of depth 1 with root 0 and children 0, 1, 2. Once all new labels are determined (lower half of Fig. 1) the nodes v 1 and v 8 still have the same label: l 1 (v 1 ) = 3 = l 1 (v 8 ), because their subtree of depth 1 was identical. After another WLSK iteration, however, the subtrees of depth 2 are no longer identical for v 1 and v 8 , so their l 2 -labels are no longer the same: l 2 (v 1 ) = 11 = 17 = l 2 (v 8 ). The final vector representation for G 1 and G 2 (after 2 iterations) consists of counts for each label (from all depths): Φ(G 1 ) = (4, 1, 2, 1, 1, 1, 1, 2, 1, 0, 0, 1, 1, 1, 1, 2, 1, 0, 0, 0, 0) The vector representation Φ(G) enables us to construct a kernel matrix or apply standard clustering and classification directly.

Discussion
Measuring graph similarity indirectly is in general more efficient than direct approaches. Among the kernel approaches it has been pointed out that with some  substructures, e.g. short paths (aka walks), many different graphs refer to the same point at the same point in the feature space (cf. [7]). Subtree kernels (and in particular WLSK) have been reported to be efficient 1 and well-performing in subsequent task (e.g. SVM classification). However, from the example in Fig. 1 we can also acknowledge the critique of the approach: Although G 2 has been obtained from G 1 by removing v 4 and adding v 12 only, the vector representations are very different. Spotting differences early is good when checking for isomorphic graphs, but may be less desirable for similarity assessment (e.g. clustering). Despite the few changes, more than half of the labels occur exclusively in only one of the graphs (13 entries out of 21 that are zero in one of the two graphs). Continuous (rather than integer) features may help, as provided by some deep learning approaches, but deep learning requires a huge amount of training data, which makes them unsuitable for datasets of moderate size.

Enriching WL Subtree Kernels
Revisiting Fig. 1, node v 3 of G 1 and node v 10 of G 2 differ only by a missing node labelled '1'. From the different l 1 -hashcodes for both nodes (5 for v 3 and 3 for v 10 ) we cannot conclude what they have in common. Secondly, node v 2 of G 1 and v 9 of G 2 are similar in the sense that nodes labelled 0 and 2 can be reached, only in G 1 there is an intermediate node v 4 . If we accept that node pairs (v 2 , v 9 ) and (v 3 , v 10 ) are somewhat similar, this should then positively affect the l 2 -similarity of v 1 and v 8 , too. We want to take this kind of similarity into account without sacrificing the efficiency of WLSK. Instead of integer features (subtree counts) we introduce continuous features to better reflect a partial matching of subtrees. We stick to the WLSK construction, but propose a post-processing step, which replaces the zero entries in the vector representation. As many subtrees (with different hashcodes) are in fact similar, we obtain highly correlating dimensions which are safe to remove and thus reduces the dimensionality. We optionally apply dimensionality reduction to arrive at a vector of moderate size.

Subtree Similarity
Given a graph G = (V, E), let L i = l i (V ) be the set of all hashcodes for subtrees of depth i (cf. tables on the right of Fig. 1). The hashcodes compress the newly constructed node labels, but no longer contain any information about the subtree. So we track this information in tables: For all occurred hashcodes h ∈ L i , we denote the root node label by r h ∈ L i−1 and the multiset of successor Next we define a series of distance functions d i : L i × L i → R to capture the distance between subtree hashcodes of the same depth i. We start with a distance d 0 for the original graph node labels. In absence of any background knowledge we use for the initial level but generally assume that some background information can be provided to arrive at meaningful distances for the initial node labels. For non-trivial subtrees (that is, i > 0) we recursively define distance functions d i (h, h ). It is natural to define the distance as the sum of distances between root and child nodes. This requires to assign child nodes of h uniquely to child nodes of h , which is provided by a bijective function f : S h → S h : Here B(S, T ) denotes the set of bijective functions f : S → T . The first term measures the distance between the root node labels and the second term identifies the minimal distance among all node assignments. Finding the assignment with minimal distance is known as the assignment problem, which has well-known solutions and we adopt the Munkres algorithm for this task [4].
We are likely to deal with unbalanced assignments, that is, different numbers of children for h and h . A bijective assignment requires |S h | = |S h |, so we add the necessary number of missing nodes (denoted by ⊥) to the smaller multiset. We extend the distance d 0 to the case of missing nodes, which corresponds to an additional row/column in the d 0 -matrix (see d 0 example matrix in Fig. 1(left)). Again, these ⊥-distances may be an arbitrary constant or specifically provided for each label h ∈ L 0 using background knowledge. Then Eq. (2) extends naturally to ⊥-values: Figure 2 shows an example. The leftmost table shows the d 0 -distances between original node labels (cf. Fig. 1: L 0 = {0, 1, 2}), including the case of a missing label ⊥. For the sake of illustration we assume a distance of 1 2 for the label pair (0, 2). Consider the comparison of v 2 and v 9 for depth-1 subtrees: . Both root nodes are identical (r h = r h = 0), but the multisets of successors are not (S h = {0, 0}, S h = {0, 2}). Matrix (i) shows the distance matrix for the assignment problem: all nodes of h (rows) have to be assigned to a node of h (columns). As the child nodes represent l 0 -hashcodes, we take the distances from the d 0 table. An optimal assignment is marked in red and we obtain a distance d 1 (h, h ) = 0 + (0 + 1 2 ) = 1 2 . Matrix (ii) shows a second example for the d 1 comparison of v 3 vs v 10 : As v 10 has three children but v 3 only two, we introduce one ⊥-element to obtain a square matrix. The optimal assignment is shown in red, the d 1 -distance becomes 1.0. Both examples contribute two values to the d 1 -distance (fourth matrix), from which we may then calculate, e.g., d 2 (l 2 (v 1 ), l 2 (v 8 )) = 0 + ( 1 2 + 1 + 0) = 1.5 (matrix (iii)).

Updating Vector Representations
Once the WLSK algorithm has been executed, we determine all d i -distances from the l i -labels alone (without revisiting the graphs). Then we update the vector  representations of all graphs, the zero entries in particular. Suppose x is a vector representation of G and x h = 0 for some h ∈ L i , which means that subtree h is not present in G. Among the subtrees that do occur in G we can now find the one most similar to h ∈ L i (smallest distance d i (h, h )) and replace x h by where k : R + → [0, 1] is a monotonically decreasing function that turns distances into similarities with k(0) = 1. The multiplication with x h accounts for the fact that h may occur multiple times in G. We used k(d) = e −(d/δ) 2 , where δ is a user-defined threshold.

Compensating Superfluous Nodes
We say v is an superfluous node if it is just a stopover on the way to yet another node, but does not contribute to the graph structure itself, that is, if the inand out-degree of v is 1. In Fig. 1 the node v 4 in G 1 is such a superfluous node. In some applications nodes with certain labels may occur occasionally, but do not carry any important information. Their existence/absence should therefore affect the graph similarity not too much. The discussed distance measure can cope with such differences when comparing, e.g., the subtree of v 2 with that of v 9 . But if we consider v 4 as an superfluous intermediate node, it brings another undesired effect: It may introduce completely new subtrees which are not present in other graphs. In the example of Fig. 1 the node v 4 introduces subtrees with hashcodes 6 (at depth 1) and 14 (at depth 2), which are not present in G 2 . When measuring the similarity of G 1 and G 2 , such subtrees make the graphs appear less similar.
We address such cases by considering the insertion of a superfluous node in our distance calculation. Figure 3 shows the situation once more: To enrich the vector representation of G 2 we seek a closest match for label h. According to Sect. 3.1 we consider, amongst others, the node v 9 with label h as a candidate. With both nodes having a single child only, finding the optimal bijective assignment f is trivial (f (k) = k ) and Eq. (2) boils down to d i−1 (r h , r h ) + d i−1 (k, k ). Now we additionally consider the insertion of a superfluous node v s with the same label as v 4 , as shown in Fig. 3 (red). Note that a hashcode l i (v s ) for the newly inserted node was not necessarily generated earlier. How would the distance between a node v 4 and v s evaluate? According to (2) we have The second part consists of a single term because both nodes have a single child only. Note that it does not depend on v s . Substituting the first term repeatedly by its definition eventually leads us to The level-0-distance to the newly inserted node is 0 by construction, however, we replace it by a penalty term d I (l 0 (v)) to reflect the fact that we had to insert a new node. As with d 0 (·, ·) we assume that d I (·) can be derived meaningfully from the application context: If, for instance, nodes with a certain label h are optional, we choose a low insertion distance d I (h) and may otherwise set d I (h) = ∞ to prevent undesired insertions. We thus arrive at a distance d * i (h, h ) for the insertion of a superfluous node which yields ∞ if the prerequisites of a superfluous nodes are not given and considers node insertion on both sides (inner min-term). The original distance (2) may then be replaced by min{d i (h, h ), d * i (h, h )} to reflect the occurrence of superfluous nodes appropriately. These changes can be handled during the precalculation of the distance matrices, the vector enrichment remains unchanged.

Complexity
Enriching the vector representations requires two steps: (1) The calculation of all distance matrices d i requires to calculate i |L i | 2 entries. For each entry we have to solve an assignment problem, which is O (d 2 log d) where d is the maximal node degree. The method is therefore unattractive for highly connected graphs. But many applications with large graphs have a bounded node degree. (2) Secondly, the vector representations x of all n graphs need to be enriched. This takes O(m z · m nz ) for each graph, where m z (resp. m nz ) is the number of entries in x with zero (resp. non-zero) entries: for each 0-entry in x we have to find the most similar 1-entry. The number of all labels from all graphs (m = i |L i |) is much larger than the number of nodes in a single graph, whereas m nz is bounded by the number of nodes in a single graph. With m nz m z we may consider m nz as a constant (max. no. of nodes) and arrive at O(n · m) for the vector enrichment.
Exercise: Write a function to count the number of entries in an integer array having a 3 at the last digit.

Application
We demonstrate the usefulness of the proposed modification in an application from computer science education. The increase in the number of CS students over the last years calls for tools that help lecturers to assess the stage of development of a whole group of students -rather than inspecting the solutions one by one. Our dataset consists of editing streams from the students source code editor (for selected exercises of an introductory programming course using Java). In our preliminary evaluation we have about 30-50 such streams per task. We extract snapshots of the code whenever a student starts to edit a different code line than before. (Many snapshot thus do not represent compileable code.) The goal is to compare editing paths against each other, for instance, to identify the most common paths or outliers. We replace the textual representation of the source snapshot by a graph capturing the abstract syntax tree and the variable usage, as can be seen in the example of Fig. 4. We want to cluster the snapshots and to construct a new graph where nodes correspond to clusters (of code snapshots) and edges indicate editing paths of students. For the experiments we applied some preprocessing (e.g. variable renaming in the graph) and assigned low insertion costs to expression-and declaration-nodes, because students may phrase conditions quite differently. Our use case for superfluous nodes (Sect. 3.3) are code blocks ({ }), which are optional if the code within the block consists of a single statement only (e.g. the ++count in Fig. 4).

Effect on Distances
To measure the effect of the enriched kernel we have manually subdivided a set of snapshots into similar and dissimilar snapshots. In a clustering setting we want the modification to carve out clusters more clearly. We therefore compare the mean distance μ w (and variance σ w ) within the group of similar graphs against the mean distance μ b (and variance σ b ) between both groups. By the factor f we denote the size of the gap between both means in multiples of the within-group standard deviation σ w , that is, f = μ b −μs σs . The factor f may be considered as a measure of separation between the cluster of similar graphs and the remaining graphs. From Table 1 we find that the enriched representation consistently yields higher values of f for the enriched than for the standard vector representation.

Dimensionality
New node labels are introduced for every new subtree, which introduces a high dimensional vector representation that has been identified as problematic in the literature (Sect. 2.3). Enriching the vector representation can help to overcome this problem, because labels with minor changes will receive similar (enriched) entries. For instance, a dataset with 718 code snapshot graphs generated as many as 5179 different subtree labels (depth 3). After enrichment we identified the number of attributes that might be removed from the dataset because it contains a highly correlating attribute already. This leads to a substantial reduction in the number of columns: Depending on the Pearson correlation threshold of 0.9/0.95/0.99 as much as 77%/68%/55% of the attributes can be discarded.

Code Graph Clustering
To reduce the dimensionality further, a principal component analysis (PCA) may be applied. Figure 5 shows the scatter plot of the principal components (PC) #2 against PC #1, #3 and #4 for the standard representation (top) and the enriched vectors (bottom). The colors indicate cluster memberships from a mean shift clustering over 4 principal components. Note that, by construction of the dataset, we do not expect the source code snapshots to fall apart completely in well separated clusters, because the data represents the evolution towards a final solution, snapshots differ by incremental changes only. In the standard case the data scatters more uniformly and less structured (left; PC1 vs PC2), while the enriched data shows two long-stretched clusters that reflect a somewhat linear code evolution for two different approaches to solve the exercise, which corresponds much better to our expectation. When taking additional component into account (PC3), the scatterplot in the middle (PC2 vs PC3) offers a clearer structure for the enriched data (e.g. the separation of the curved red cluster at the top) than the original data. Figure 6 shows how the clusters are used in the context of our application. Each cluster (like those in Fig. 5, but for a different exercise) corresponds to a node in this graph. Whenever a student changes the code and thereby moves to a different cluster, a (directed) edge is inserted. The number of students who have followed a path is written nearby the edge. Clusters that have only one incoming and one outgoing edge are not shown for the sake of brevity. The green color indicates the degree of unit-test fulfillment. The node labels a : b(c|d) carry information about the cluster id a, number of students b that came across this node, number of students c (resp. d) who started (resp. ended) in this node. From this example the lecturer can immediately recognize that 42 students start in cluster #1, from where most students (25) transition to cluster #2 and 10 more students reach the same cluster via cluster #4 as an intermediate step. Cluster  #2 does not yet correspond to a perfect solution, but only 12 students manage to reach the green cluster #3 from cluster #2. Other clusters and edges have much smaller numbers, they cover exotic solutions or trial-and-error approaches. The graph provides a good overview about the students performance as a group.

Conclusions
Weisfeiler-Lehman subtree kernels can be used to transform graphs into a meaningful vector representation, but suffer from high dimensionality and sparsity, such that the similarity assessment is limited. We overcome both problems by taking the subtree distances into account -which are simpler to assess than general tree distance, because only subtrees of equal depth need to be considered. Based on the subtree distance we enrich the zero entries of graph vectors and improve the similarity assessment. A removal of highly correlating attributes reduces the dimensionality considerably. The modifications turned out to be advantageous in a use case of source code snapshot clustering.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.