Keywords

1 Introduction

The era of big data has reached academic and industrial pharmaceutical drug research in the last decade, which has changed how drugs are developed. Nowadays, large collections of bioactivity data and large databases of potentially synthesizable molecules exist. Publically available bioactivity databases like ChEMBL [4] or Pubchem [25] contain over 16 million data points about molecules that modulate protein or drug target functions. This allows data-driven decision via in-depth data mining and knowledge discovery approaches, e.g., the identification of similar molecules for the prediction of a known protein target or unwanted side effects. The extraction of molecular features enables an increasingly reliable prediction of properties such as toxicity or oral availability.

The chemical space of drug-like molecules provides another source of big data. Theoretical analysis of the comprehensive chemical space estimates around \(10^{62}\) molecules with a typical drug size. Among those, around 166 billion molecules are described by the chemical universe database GDB-17 that was build up using 17 standard atoms that occur within drugs [47]. The REAL spaceFootnote 1, a large collection of commercially available chemical compounds, contains about 15.5 billion molecules that are potentially synthesizable. Finally, the current version of the ZINC database [55] contains over 750 million purchasable compounds that were already synthesized.

Several established tools and work-flows are available that utilize bioactivity data or the chemical space for the rational development of bioactive molecules [20 SPP]. These approaches are based on the common basic assumption that similar molecular structures have similar bioactivities. A classical approach for identifying similarity between molecules is to use molecular fingerprints that are fixed size vectorial representations of structural characteristics, e.g., extended-connectivity fingerprints [46]. Although this form of representation allows fast comparisons and the usage of fast vector-based tools, vectorization suffers from an information loss and can lead to inaccurate discrimination of similar molecules. This becomes a problem for the above described ’big’ molecular databases since the available similarity measures do not discriminate enough. Thus, similarity searches in such databases contain too many false positives, which hampers further processing.

A more accurate comparison of molecules is directly based on the graph representation of the chemical structures. This representation allows the modeling of the molecules as graphs with attributes and the use of graph-theoretic concepts, algorithms and tools to analyze molecular databases. Figure 1 shows two similar molecules and their graph representation. The atoms are modeled as vertices, and the bonds between the atoms as edges. Attributes for the graph could be, e.g., a label for the vertices providing the atom name and a label for the edges encoding the binding type.

Fig. 1.
figure 1

Two similar molecules (Sildenafil and Vardenafil) and their corresponding graphs. The colors display the atom types (nodes) and bond types (edges).

Unfortunately, the use of molecular graphs to compare two molecules based on the concept of isomorphism is notoriously much more time consuming compared to molecular fingerprint-based similarity search. Additionally, a comparison based on the maximum common substructure (maximum common subgraph) between two molecules may fail in the identification of molecules with similar chemical properties since the classical definition of a common substructure is too strict under some circumstances. Therefore, novel methods are urgently needed for the analysis of the still increasing amount of molecular data. The focus of the interdisciplinary project “Graph-based Methods for Rational Drug Design” has been the development of new structural approaches w.r.t. molecular similarity search and molecular clustering. This chapter presents some of the main results and puts them into a wider context of graph similarity.

Preliminaries and mathematical definitions are provided in Sect. 2. State-of-the-art methods for comparing graphs w.r.t. the size of their Maximum Common Subgraph (MCS) in the context of molecular graphs are discussed in Sect. 3. For drug design, it is often advisable to preserve certain molecular substructures –such as rings, blocks, or bridges– in comparisons since they have special biochemical properties as a whole. This method for comparing molecules can be further improved by incorporating chemical knowledge about reasonable atom or substructure substitutions that presumably do not affect bioactivity considerably. Figure 1 shows an example of two drugs with atom substitutions in the bicyclic structure that do not affect bioactivity. In our model of similarity, it is allowed to change certain structures of the graphs and still mark them as structurally equivalent. Thus, we have introduced the Maximum Similar Subgraph (MSS) problem. Our findings, including algorithms and experimental results, are discussed in Sect. 3.2.

Clustering analysis is used for a variety of tasks in drug discovery. This includes complexity reduction, structure activity relationship reasoning in visual analytics, novelty analysis of de novo databases (see Sect. 5.3), diversity analysis, structured sampling, and many more. Cluster analysis on huge molecular databases is the topic in Sect. 4. First, we discuss computational and information-theoretic challenges before we present a scalable state-of-the-art structural clustering algorithm (StruClus) that tackles these challenges. In Sect. 5, we discuss some selected successful applications in rational drug design in the context of this priority program. In a scaffold-focused analysis of bioactivity data, we discovered an unexpected similarity in ligand binding between two important drug targets (BRD4 and PPAR\(\gamma \)) in cancer therapy (cf.  Sect. 5.2). This discovery was possible using Scaffold Hunter, an open-source tool developed in our group to support the drug discovery process (cf. Sect. 5.1). In Sect. 5.3, we present CHIPMUNK, a new virtual database of more than 95 million synthesizable small molecules. Using StruClus, it was possible to demonstrate the novelty of the database in comparison to existing molecular libraries.

2 Preliminaries

An undirected labeled graph \(G=(V,E,l)\) consists of a finite set of vertices \(V(G)=V\), a finite set of edges \(E(G)=E\) and a labeling function \(l:V\biguplus E \rightarrow L\), where L is a finite set of labels. An edge \(\{u,v\}\) connects two vertices \(u,v\in V\), \(u\not =v\). A (simple) path of length n is a sequence of vertices \((v_0, \dots , v_n)\) such that \(\{v_i,v_{i+1}\} \in E\) and \(v_i \not = v_j\) for \(i \not = j\), \(i,j=0, \ldots ,n-1\). A tree is a graph in which any two vertices are connected by a unique path. A graph is called planar if it admits a drawing on the plane without edge crossings, and it is outerplanar if such a drawing is possible in which every vertex lies on the boundary of the outer face.

For our similarity approaches based on subgraph isomorphisms, we need the following definitions. Let G and H be two undirected labeled graphs. A (label preserving) subgraph isomorphism from G to H is an injection \(\psi :V(G) \rightarrow V(H)\), where \(\forall v \in V(G): l(v) = l(\psi (v))\) and \(\forall u,v \in V(G) : \{u,v\} \in E(G) \Rightarrow \{\psi (u),\psi (v)\} \in E(H) \wedge l(\{u,v\}) = l(\{\psi (u),\psi (v)\})\). If there exists a subgraph isomorphism from G to H, we say H supports G, G is a subgraph of H, H is a supergraph of G or write \(G \subseteq H\). If additionally \(\{u,v\}\in E(G)\Leftarrow \{\psi (u),\psi (v)\}\in E(H)\) for all \(u,v \in V(G)\), then \(\psi \) is an induced subgraph isomorphism. If there exists a subgraph isomorphism from G to H and from H to G, the two graphs are isomorphic. A common subgraph (cf. Fig. 2) of G and H is a graph C that is subgraph isomorphic to G and H. A maximum common subgraph (MCS) is a common subgraph of maximum size (vertices plus edges).

Fig. 2.
figure 2

Example: A common subgraph C of the graphs G and H. Dashed arrows indicate the subgraph isomorphism.

A graph \(G=(V,E)\) with \(|V| \ge 3\) is called biconnected if \(G\setminus \{v\}\) is connected for each \(v \in V\). A maximal biconnected subgraph of a graph G is called block. An edge \(\{u,v\}\in E(G)\) not contained in any block of G is a bridge. A vertex v of G is called cutvertex, if \(G\setminus \{v\}\) consists of more connected components than G. A BC-tree \(\text {BC}^G\) of a graph G consists of a node for each block and bridge in G and all the cutvertices of G. Two nodes (blocks or bridges) \(b,b'\) in a BC-tree are connected through the path \(bcb'\) if they share the cutvertex \(c\in V(G)\). Figure 3 exemplifies a graph and its BC-tree. Let S and G be graphs, and \(\psi :V(S)\rightarrow V(G)\) be a subgraph isomorphism. Then \(\psi \) is block-and-bridge-preserving (BBP) if any two edges in different blocks in S map to different blocks in G, and each bridge in S maps to a bridge in G.

Fig. 3.
figure 3

A connected graph (left side) and its BC-tree (right side). The BC-trees’ block nodes are depicted as green squares; the bridge nodes as blue squares. The white filled circles are the cutvertices. The associated subgraphs of G are depicted above the blocks and bridges. (Color figure online)

The support \(\text {supp}(G, \mathcal {G})\) of a graph G over a set of graphs \(\mathcal {G}\) is the fraction of graphs in \(\mathcal {G}\) that support G. G is said to be frequent if its support is larger or equal than a minimum support threshold \( supp _{\min }\). A frequent subgraph G is maximal if there exists no proper frequent supergraph of G.

3 Molecular Similarity Based on Graphs

An essential criterion of molecular similarity in drug design is not only the similarity in chemical structure but also the similarity in biological activity or bioactivity. In order to obtain molecular similarities meeting this requirement, we introduce a graph-based method, which addresses the following problem.

Definition 1

Given two molecular graphs G and H, the maximum similar subgraph problem is to find chemical meaningful subgraphs of G and H with equivalent bioactivity.

Starting from this informal description, we introduce clearly defined graph-theoretical problems extending the maximum common subgraph paradigm. Since scalability is a critical concern, algorithmic aspects and complexity results must be taken into account and related to the specific properties of molecular graphs. These graphs are almost always planar and often outerplanar [18]. Since the number of bonds per atom is limited, the vertex degrees are bounded. It can be observed that all the graphs representing small molecules have a small tree width. The tree width of a graph essentially measures the similarity of a graph to a tree structure. Trees have tree width 1, and graphs that can be constructed via parallel or serial merges (series-parallel graphs) have tree width 2. Typically, molecular graphs have vertex and edge attributes that are either discrete labels or numerical values.

We proceed with a discussion of similarity approaches based on the maximum common subgraph paradigm and the specific challenges when applied to molecular graphs. Then, new graph-based methods are introduced, which address these challenges as part of the maximum similar subgraph problem.

3.1 Challenges and Approaches in Comparing Molecular Graphs

The maximum common subgraph problem is to find a common subgraph in two given graphs of maximum size. In the domain of cheminformatics, the maximum common subgraph problem has been extensively studied [12, 44, 50]; see [28 SPP] for a recent survey. In this domain, it is often referred to as the maximum or largest common substructure problem. This problem is known to be \(\mathcal{N}\mathcal{P}\)-hard. With trees as input and output, the problem was shown to be polynomial-time solvable [35], but bioactive molecular graphs are not trees in general. The fact that they are mostly outerplanar does not directly lead to efficient algorithms since the maximum common subgraph problem restricted to outerplanar graphs remains \(\mathcal{N}\mathcal{P}\)-hard. Instead of developing maximum common subgraph algorithms for more general graph classes, which has been proven difficult, a different approach represents molecules simplified as trees [41]. Then, vertices typically represent groups of atoms, and their comparison requires rating the similarity of two vertices by a weight function. However, similar to fingerprints, this goes along with a loss of information. Especially when comparing to large molecular databases, e.g., to rank the molecules regarding their similarity, this loss of information can lead to a reduced distinctiveness [21 SPP].

For molecular graphs, there is a variation of the maximum common subgraph problem of high practical relevance. There, the block (i.e.,connected set of molecular rings) and bridge (i.e.,molecular chain) structure of the input graphs must be retained by the common subgraph, i.e.,the underlying subgraph isomorphism is block-and-bridge preserving (BBP). This variation is denoted block-and-bridge preserving maximum common subgraph problem (BMCS) and requires the common subgraph to be connected and the associated subgraph isomorphisms to be BBP. There is a variant of the problem where the subgraphs are not necessarily (vertex) induced. This edge induced variant is denoted as BMCES. For both variants, it has been shown that they yield meaningful results for cheminformatics and are computable in polynomial time on outerplanar graphs [50, 21 SPP, 10 SPP].

In [50], a BMCES algorithm was proposed for outerplanar molecular graphs. Contrary to the original claim of \(\mathcal {O}{(n^{2.5})}\) for a graph with n vertices, the algorithm allows no better bound than \(\mathcal {O}{(n^4)}\) on its running time [30 SPP]. A previously suggested algorithm regarding the BMCS problem for input graphs with tree width \(k\le 2\) has a running time of \(\mathcal {O}{(n^6)}\) [32 SPP]. In the case of outerplanar input graphs, the running time can be reduced to \(\mathcal {O}{(n^5)}\). An essential part of this algorithm is the decomposition of the graphs into their BC- and SPQR-trees, which decompose the graphs into their biconnected and three-connected components. A maximum solution is then computed via a dynamic programming approach on the blocks and bridges.

Following the above result, we presented a faster approach tailored to outerplanar graphs [10 SPP]. On such graphs G and H, this algorithm achieves a running time of \(\mathcal {O}{(|G|\,|H|\,\varDelta (G,H))}\), where \(\varDelta (G,H)=1\) if G or H is biconnected; otherwise, \(\varDelta (G,H)=\min \{\varDelta _C(G),\varDelta _C(H)\}\), where \(\varDelta _C(G)\) and \(\varDelta _C(H)\), respectively, is the maximum degree of all cutvertices in G and H, respectively. For outerplanar molecular graphs, the time bound is \(\mathcal {O}{(|G|\,|H|)}\) since they have bounded degree. The first major ingredient is a fast dynamic programming approach on the BC-trees of the input graphs, where we exploit the similarity between the maximum weight matching instances that we have to solve [9 SPP]. Here, we use an algorithm for the maximum weight matching problem with a running time depending on the smaller vertex set. The second ingredient is a quadratic time algorithm to find a biconnected maximum common subgraph between two blocks \(b_1\) and \(b_2\). This is realized by enumerating all maximal (with respect to inclusion) biconnected common subgraphs between the two blocks. Each maximal solution C can be computed in time \(\mathcal {O}{(|C|)}\). The total size of all maximal solutions per block pair \((b_1,b_2)\) is \(\mathcal {O}{(|b_1|\,|b_2|)}\); hence the total algorithm’s running time is \(\mathcal {O}{(|G|\,|H|)}\). Along the edges and vertices with different labels, the maximal solutions are split into smaller biconnected components. Among all those components, we keep one of maximum size.

For non-outerplanar graphs, we use a clique reduction to compute biconnected maximum common subgraphs between two blocks if at least one is not outerplanar. In the reduction, we enumerate c-cliques as presented in [8, 27]. Among them, we keep a biconnected c-clique of maximum size. This approach reduces the practical running time compared to a pure clique-based algorithm operating on the whole graphs since the computational demanding clique problem must be solved for small components only. In contrast to the BMCES algorithm of [50], the above-described technique enables our algorithm to compute a solution for any two molecular graphs and lower the practical running time for graphs with multiple blocks, even if they are not outerplanar.

We evaluated the practical running time of our algorithm [10 SPP] by comparing it to the BMCES algorithm from [50]. In our experiments, we used a dataset of 29 000 randomly chosen pairs of outerplanar molecular graphs from the NCI Open Database, GI50Footnote 2, with an average of 22 vertices (atoms) and a maximum of 104 vertices. Our algorithm outperformed the competitor by a factor of 84 on average. The experimental results align with our theoretical correction [30 SPP] of the running time analysis given in [50]. It should be noted that the BMCES algorithm is already much faster than a general clique-based MCS algorithm [50]. Our BMCS algorithm outperforms such a general algorithm by several orders of magnitudes. The practical differences of the results w.r.t. the vertex and edge induced variants is marginal, and we observed a disagreement in only 0.4% of the comparisons.

While our basic BMCS algorithm is fast in theory and practice, the primary goal is to find a meaningful common subgraph. It was observed that allowing disconnected common subgraphs improves the validity, given that the connected components are arranged consistently in both graphs [34, 51]. However, solving the general disconnected variant is \(\mathcal{N}\mathcal{P}\)-hard even in trees. Moreover, small variations of the chemical elements (vertex labels) might be tolerated. We tackle these challenges in the next subsection.

3.2 Maximum Similar Subgraph Based Similarities for Molecules

This subsection presents several problem fields where the classical MCS definition is too strict w.r.t. molecular bioactivity. We show how these problem fields can be theoretically approached under the MSS definition and how they can be solved programmatically by integration in the MCS algorithms. Subsequently, we evaluate our MSS approach in comparison with several established molecular similarity measures.

From a chemical point of view, the two drugs shown in Fig. 1 are almost identical and are expected to have nearly identical properties w.r.t. bioactivity. However, an MCS-based comparison would interpret a large part of the molecules as different due to the nitrogen switch in the bicyclic ring system. In other words, the exchange of a nitrogen and carbon atom in an aromatic ring should influence the molecular similarity only to a small extend under the maximum similar subgraph problem definition. In addition, atom types like aromatic nitrogen or carbon can be grouped by their properties and such atom type groups can be used as representation instead. Thus, by softening the matching constraints in the MSS problem, a much larger substructure should be identified in the two molecules in comparison to the MCS approach. This problem can be solved with an atom type group representation [39] and a score in the range \([0,1]\cup \{-\infty \}\) to group mappings (mapping of vertices with atom type group labels), where \(\{-\infty \}\) forbids the mapping. Hence, the objective is to maximize the weight of all mapped groups instead of the number of mapped vertices. The complete weight matrix is listed in Table III.2.3 of [19].

Additionally, we allow the mapping of disjoint paths of bridges (more precisely, the path’s endpoints while skipping the inner vertices) to each other [11 SPP] in our MSS approach, i.e.,we allow some kind of disconnection. We denote this technique embedding, following [17]. This is useful, e.g., if two molecules differ only in the length of a chain connecting similar or identical substructures. To prevent arbitrary long paths, we introduce a linear penalty depending on the length of such paths. An example of two molecular graphs that profit from the described approach is depicted in Fig. 4.

In summary, we developed an algorithm applicable to molecular graphs that addresses the maximum similar subgraph problem by (i) using the established BMCS concept, (ii) allowing disconnectivity by mapping paths to edges, and (iii) supporting weight functions between labels. Moreover, our algorithm is efficient in theory and practice for the vast majority of molecular graphs.

Fig. 4.
figure 4

Molecular graphs of Melphalan (top) and Chlorambucil (bottom). The BMCS on the left (red) maps less vertices than the BMCS embedding on the right (blue, green). The atoms on the right side (O, O, H) may be added to the embedding by mapping the green paths to each other. (Color figure online)

In order to evaluate the quality of our MSS approach, we used a similar setup as in [40] and compared it to state-of-the-art chemical fingerprint methods. Our main question was whether the MSS approach produces meaningful results when used to rank molecules. In the following, we present the key evaluation results for the single-assay benchmark, which consists of rather similar molecules that have been ranked by the authors w.r.t. decreasing order of activity.

First, we analyzed different layers to represent the molecules. Among them are the chemical elements representation (e.g., N for nitrogen) and the file conversion (fconv) atom type groups [39]. We discovered that the latter representation based on the weight matrix of Table III.2.3 of [19] performed best for the single-assay benchmark. As similarity coefficient, we used Bunke and Shearer’s [43], which performed best among the tested ones. It is defined as \(W/\max \{k,l\}\), where W is the weight of the maximum common subgraph, and k, l are the sizes of the input graphs.

Compared to other methods, the very popular ECFP4 fingerprint showed the best match with the reference ranking followed by our MSS embedding approach. This is followed by RDKit7 (fingerprint of all subpaths up to path length 7), MSS without embedding, RDKit6, and the BMCS approach. Other fingerprint methods ranked in between. Extended-Connectivity Fingerprints (ECFPs) capture the neighborhood of the non-hydrogen atoms in circular layers up to a given diameter (e.g., 4 in the case of ECFP4). Thus, their features, similarly to the MSS, also represent the presence of particular small substructures. However, the advantage of our MSS approach is that it explicitly computes the similar substructures of the molecules and a concrete mapping between the atoms (vertices). It also achieved a high distinctiveness between the results, which is important to virtually screen large (big data) molecular libraries. The additional feature of mapping disjoint paths to each other showed improved results on the ranking benchmarks. More detailed results, as well as additional tests, can be found in [19].

4 Clustering Analysis

As mentioned in the introduction, clustering is used for a variety of use cases in drug discovery. In the following, we will focus on the task to cluster large scale molecular datasets of labeled graphs. An application of the presented approach is given in Sect. 5.3.

Definition 2

A clustering of a graph dataset —i.e.,a multiset of labeled graphs— \(\mathcal {X}\) is a partition \(\mathcal {C}= \{C_1, \dots , C_n\}\) of \(\mathcal {X}\), that maximizes cluster homogeneity and often separation.

The concrete definitions of homogeneity and separation differ in different clustering methods. Common measures for homogeneity are diameters or radii, density, or relative closeness to cluster representatives. Separation is often defined over the minimum distance between cluster elements or some aggregated cluster features. In contrast to homogeneity, separation is not always considered by clustering algorithms. For example, it is challenging to find a suitable definition of separation for projected clustering algorithms, since each cluster is linked to its own subspace and by that is incomparable to other clusters. Meta algorithms can be used to tune some of the clustering algorithms that do not optimize separation directly to achieve well-separated clusterings. For example, the number of clusters can be used as such a tuning parameter for the classical k-means algorithm [56].

4.1 Challenges and Approaches in Molecular Cluster Analysis

A major design decision for clustering algorithms is the data representation. Most classical clustering algorithms rely on vectorial data interpreted as points in some predefined space (e.g., \(\mathcal {R^n}\) with \(l^2\)-norm) or, more generally, on pairwise distances or kernels. Exchangeable distances or kernels are very versatile since it allows the clustering algorithm to be adapted to the specific clustering task. However, the explicit vector space representation with a fixed norm is often beneficial in terms of computational complexity. For example, it allows the explicit calculation of centroids, easy extraction of subspaces, or the use of binning. With these methods, it is often possible to avoid calculating a quadratic number of pairwise distances during the clustering process.

To fit a graph dataset into this models, the graphs must be either transformed into vectors (e.g., by using structural fingerprints or Weisfeiler-Lehman features [38]) or kernels/distances must operate directly on graph data (e.g., graph kernels [31] or the distance given in Sect. 3). However, while preferable in terms of generality, these generalized methods have weaknesses in the discussed domain.

First, both methods tend to produce (intrinsic) high-dimensional datasets [29]. While a high dimensionality may even be beneficial in supervised learning, intrinsic high dimensional datasets are linked to the so-called concentration effect [5] in the unsupervised setting. This effect causes the pairwise distances to lose their relative contrast, i.e.,the distances converge towards a common value. The concentration effect is closely related to a bad clusterability [1, 57]. Furthermore, it causes metric index structures to be inefficient. Subspace or projected clustering methods, which are usually used in such a setting, come with an extra computational burden and are usually limited to vector space.

Second, the transformation to reasonably sized vectors is lossy and non-reversible. This causes the clustering results to be hard to interpret since cluster features, centroids, or subspaces are not in the application domain. Thus, these methods fail to provide a domain specific explanation about cluster commonalities.

As a consequence of these issues, structural clustering methods have been developed, which provide cluster descriptions or interpretations directly in the graph domain. This is accomplished by various constructs, including subgraph isomorphisms, (maximum) common subgraphs [6], frequent subgraphs [57], graph edit operations [23], and set medians [14, 23]. For example, a cluster description can be given in the form of common subgraphs. Since most of these sub-problems are themselves challenging \(\mathcal{N}\mathcal{P}\)-hard problems, structural clustering algorithms are often limited to small datasets (e.g., [14, 23, 58]) or very special graph classes (e.g., trees [3]). As a consequence of the computational complexity, some of these clustering algorithms are hybrid approaches, which utilize approximations in vector space in order to map the results back into the graph domain. For example, the clustering algorithms in [14, 23] calculate a cluster median in vector space but assign graphs to clusters w.r.t. the graph edit distance. A hierarchical k-means clustering in vector space is used as a starting point in [6]. It is later refined in order to increase the size of the common substructures. To the best of our knowledge and besides our own work, the only structural clustering algorithm for larger-scale datasets of general labeled graphs is presented in [54]. In this algorithm, each partition element of a vectorial pre-clustering is further partitioned with a structural algorithm. The pre-clustering is designed to only separate graphs that are also separated by the structural variant with a high probability if the structural clustering would be applied to the whole dataset.

4.2 StruClus: Scalable Structural Graph Set Clustering

StruClus [49 SPP] is a structural projected clustering algorithm that is tailored towards our setting of large-scale datasets (\(\gg 10^6\) graphs) of small labeled graphs (druglike molecules are limited in their maximum size for biological reasons). Its linear runtime w.r.t. to the dataset size, the usage of various sampling strategies, and a parallelizable algorithm design make StruClus scalable and very fast in practice. It incorporates homogeneity and separation constraints for high-quality results.

Fig. 5.
figure 5

Real world clusters with representatives (grey boxes) generated by StruClus. Colors represent node labels. (Color figure online)

A central concept of StruClus is the usage of cluster representatives sets \(\mathcal {R}(C)\) for each \(C\in \mathcal {C}\) (cf. Fig. 5 for a real-world example) that contain frequent subgraphs of the cluster members. They are beneficial in terms of computational complexity since they enable graph-cluster comparisons without looking at the cluster members (similar to the concepts of centroids or medoids). Additionally, they lead to human interpretable clusters by explaining the cluster content in the application domain.

The main objective of StruClus is to maximize homogeneity in the sense that the large fraction of the nodes and edges of the cluster members are covered by some subgraph isomorphism from the representatives. Similar to the classical k-means algorithm, this is achieved by an iterative optimization procedure that updates the representatives and re-assigns the cluster members to the best fitting cluster. However, the number of clusters is not pre-defined but adapted to the dataset structure with the help of cluster splitting operations on inhomogeneous clusters. Additionally, clusters with similar representatives are merged in order to maintain a well separated clustering.

Fig. 6.
figure 6

Example for a meet-semilattice of subgraphs ordered by the subgraph isomorphism relation. Node colors indicate labels. Maximal frequent subgraphs are marked with a blue background color. (Color figure online)

Performance-wise, the major challenge lies in the discovery of suitable representatives \(\mathcal {R}(C)\) for each cluster C. Since the number of frequent subgraphs may be exponential w.r.t. the maximal graph size in the cluster, StruClus utilizes a randomized maximal frequent subgraph sampling method. This is implemented by a random exploration of the frequent subgraphs of each \(C\), which form a meet-semilattice with partial ordering derived from the sub and supergraph relation (cf. Fig. 6). Each random exploration starts with the empty graph and moves up in the lattice until a maximum frequent subgraph is reached. Since the support is monotonically decreasing for the supergraph relationship, it is possible to prune the search space with the minimum support threshold \( supp _{\min }\).

The above-described maximum frequent pattern sampling is complemented with a new error bound stochastic sampling strategy over the cluster members to determine whether a graph pattern is frequent. A subset of the maximal frequent subgraphs given by this twofold sampling procedure is then selected by ranking the frequent subgraphs w.r.t. to the above homogeneity criteria.

In comparison with structural clustering competitors, such as [14, 23, 53, 54, 58], StruClus is able to raise the maximum dataset size by multiple orders of magnitude, reaching into the domain of large-scale de novo databases. At the same time, StruClus outperforms structural competitors with a suitable performance for medium to large-scale datasets in terms of quality. Figure 7 shows an extract of an in-depth evaluation given in [49 SPP] w.r.t. to quality and performance on a real-world dataset (heterocycle) and a synthetic dataset. The heterocycle dataset consists of composed molecules classified by their reaction types. The synthetic dataset has common subgraphs for a class of graphs and is used to perform analysis with varying parameters. In Sect. 5.3, we present a real-world use case of StruClus.

5 Rational Drug Design Applications

In this section, we present successful applications of the above approaches. Additionally, we present our tool Scaffold Hunter, that brings the scientific findings into the realm of practical drug design.

Fig. 7.
figure 7

StruClus evaluation in comparison with SCAP [54], Proclus [2], and Kernel K-Means [16]. Graphlet –i.e.,small induced subgraph– frequencies are used for Proclus and Kernel K-Means.

5.1 Scaffold Hunter

Scaffold Hunter [26, 48 SPP] is open-source software for the analysis and visualization of molecular data with the aim to support the user in elucidating structure-activity-relationships. To this end, it features several structural classification schemes with dedicated visualizations and techniques to indicate chemical properties such as biological activity, e.g., by mapping values to colors, cf. Fig. 8. A fundamental structure-based concept is based on common core structures, so-called scaffolds, which can be organized hierarchically in a scaffold tree [52]. This approach forms the basis for several views, which show the scaffold tree in a radial layout, in the form of a tree map or a set of scaffolds as a molecule cloud [13]. The view is inspired by the popular word cloud method, where the importance of words is indicated by their size. Here, scaffolds are scaled according to the number of molecules in the dataset containing them.

Following a different concept, structure-based hierarchical clustering is supported by means of chemical fingerprint similarity. Specifically for very large data sets, we have developed a heuristic method based on metric indexing [29]. The result can be visualized as a dendrogram that can be linked to a table or a heatmap. The heatmap visualizes property values in a matrix using color coding, where the columns are ordered in accordance with the dendrogram. This allows identifying whether chemical properties align with the structural similarity.

Fig. 8.
figure 8

Scaffold Hunter allows to visualize molecular data in various linked views.

Several publications have shown that Scaffold Hunter is useful in various research tasks such as scaffold hopping, target prediction, chemical space analysis, and natural product simplification [7, 26, 33, 45, 21 SPP].

Fig. 9.
figure 9

Co-crystal structure of BRD4 in complex with one of the identified novel inhibitors (6g0e@pdb).

5.2 BRD4

In this study, an unexpected similarity in ligand binding between the bromodomain-containing protein 4 (BRD4) and the peroxisome-proliferator activated receptor gamma (PPAR\(\gamma \)) was identified. Both are important drug targets in cancer therapy, cardiovascular diseases, and inflammation processes [15, 24]. The starting point was a scaffold-focused analysis of bioactivity data using the command-line version of Scaffold Hunter [48 SPP]. This analysis revealed a bicyclic scaffold that can be found, amongst others, in known ligands for BRD4 and PPAR\(\gamma \). Compounds with similarity to known PPAR\(\gamma \) ligands were subsequently selected and tested on BRD4. Interestingly, the hit rate, which means the number of actives on BRD4, was unexpectedly high. Some of the novel inhibitors were successfully co-crystallized. One example is shown in Fig. 9. Further analyses of both proteins support the discovery of an unexpected relationship between the two drug targets [21 SPP] because they also show a high similarity of their binding sites. Based on this result, it seems possible to develop a drug that modulates both proteins with synergistic effects. Such a dual modulator would have the potential to have implications for the prevention or treatment of resistances against BRD4 inhibitors, which could already be observed [42]. Thus, this study demonstrates the successful application of a graph-based method in a prospective drug discovery study.

5.3 Chipmunk

CHIPMUNK (CHemically feasible In silico Public Molecular UNiverse Knowledge base) [22 SPP] is a novel virtual library of small molecules which are synthesizable from purchasable reactants. The goal of such de novo libraries is the expansion of the known chemical and bioactivity space in order to enable virtual analytical processes to extract meaningful novel molecular structures, e.g., for drug discovery. The in silico simulated reactions are chosen such that they are synthesizable in reality with a high probability. Altogether, CHIPMUNK covers over 95 million compounds.

Fig. 10.
figure 10

Per cluster database distribution for the novelty analysis of CHIPMUNK. The green share is the MCR-CHIPMUNK sublibrary, blue is ChEMBL, red are commercially available compounds. The plot shows that some clusters are (almost) exclusively covered by CHIPMUNK. [taken from [22 SPP], printed with permission from Wiley] . (Color figure online)

In the evaluation of CHIPMUNK, it was shown that the content of the library has interesting chemical properties and that the library covers previously undiscovered regions of the chemical and bioactivity space. The former aspect was analyzed using descriptor-based methods. It revealed that CHIPMUNK nicely covers the physicochemical space of protein modulators and protein-protein interaction modulators. StruClus (cf. Sect. 4.2) was used for the evaluation of the latter aspect, the novelty analysis. Additionally, StruClus itself was evaluated to prove that it creates useful clusterings w.r.t. to chemical properties (refer to [22 SPP] for further details). Thus, molecules of the same cluster exhibit similar chemical and biological properties with a high probability.

To analyze the novelty of CHIPMUNK, several libraries of commercially available compounds (ZINC [55], MolPortFootnote 3, and eMoleculesFootnote 4) as well as the large scale ChEMBL [4] bioactivity database were clustered in conjunction with CHIPMUNK. The former libraries serve as known chemical space, whereas the latter serves as known bioactivity space. The clustering revealed a large portion of clusters consisting purely of CHIPMUNK compounds (cf. Fig. 10 for an example).

Thus, it was displayed that CHIPMUNK encompasses regions that are uncovered by existing databases but yet with protein modulator or protein-protein interaction modulator like physicochemical properties. It can be concluded that CHIPMUNK offers the potential to contain future drugs.

The CHIPMUNK library is publicly available together with the clustering results. Areas of the chemical space –i.e.,clusters– that overlap with the ChEMBL library can be used to relate novel molecules given in CHIPMUNK to already known molecules from ChEMBL (in terms of structural similarity). This is helpful to relate already existing knowledge to the CHIPMUNK library. Thus, it may help in identifying the biological targets for the CHIPMUNK compounds.

6 Conclusion and Outlook

Graph-based methods for the analysis of molecular data sets are particularly appealing because they can reveal subtle structural differences and allow interpretation in terms of substructures. The complexity of the related graph-theoretical problems, however, makes their applications to large data sets challenging. We have developed new methods based on common substructures, which take the specific constraints in cheminformatics into account and exploit the properties of molecular graphs. Thereby, our techniques become efficient in both theory and practice. The application to molecular similarity search shows that our approach produces chemically meaningful rankings of molecules. Thus, it is well suited for virtual screening in large molecular databases. Moreover, we have developed a structural clustering algorithm, which represents clusters by common substructures and scales to very large databases with millions of molecules. Our methods have been proven to be useful in various research tasks in rational drug design. The success of our approaches has also been appreciated in 2018, when we were invited to the cover feature of the June issue of ChemMedChem (cf. Fig. 11).

Fig. 11.
figure 11

The Cover Feature shows three chipmunks involved in the creation, analysis, and clustering of the synthesizable virtual molecule library CHIPMUNK. Nearly 100 million compounds were generated with in silico reactions on accessible building blocks, and their descriptor profile was analyzed. [taken from [22 SPP], printed with permission from Wiley]

During the writing of this survey, our project is still ongoing. Currently, we develop a distributed algorithm to mine representative sets of subgraphs for a variety of different use cases, including but not limited to the development of a fully distributed structural clustering algorithm. For this, the discussions and results within the SPP have been very useful (cf. Chap. 14).

Within our project, we have also developed other approaches to algorithmic data analysis. E.g., we have studied Graph Neural Networks (GNN) and their use to generate molecular representations for application in virtual screening approaches. Here, GNNs performed worse than fingerprint-based multilayer perceptrons, which questions the use of simple GNNs to obtain molecular representations [36 SPP, 37 SPP]. Future work will show if more complex graph-based representations will be able to replace molecular fingerprints as suitable input. For these learning approaches, it will be helpful to also learn with large generated graph families (cf. Chap. 2 and Chap. 3). Together with Christian Schulz, we investigate the applicability of kernelization (cf. Chap. 5), i.e., the iterative reduction of the problem to smaller instances, to common subgraph problems in large graphs. Matching problems build a connection with Chap. 13, which also is concerned with life science applications. Jointly we have worked on new streaming algorithms approximating the bipartite matching problem.