1 Introduction

Ideally, drug development commences with the knowledge that disrupting the functioning of a specific protein or enzyme will yield some positive medicinal effect. It is known, for example, that the enzyme plasmepsin II is a protein involved in the degradation of haemoglobin in the parasite Plasmodium falciparum. This parasite is the major cause of malaria, one of the world’s worst diseases. (There are over 200 million cases of malaria worldwide (WHO 2014)). Inhibiting plasmepsin II disrupts the degradation of haemoglobin and kills the parasite (Silva et al. 1996) and consequently, the disease. Plasmepsin II thus constitutes the primary target for anti-malarial drug development. Other than Plasmepsin II, there are multiple proteins (Plasmepsin I, IV and V) belonging to the same protein family (aspartic proteases) which are crucial for survival of the parasite in human red blood cells (Berry 1997; Wyatt and Berry 2002; Sedwick 2014). These proteins constitute a group of potential targets for drug designing. In such cases, a type of drug is needed that can act against these multiple targets.

Fig. 1
figure 1

A simplified view of the chemistry involved in the drug-design process. Ligands are small molecules that can bind to the active site (this is usually found by empirical testing), and leads are small molecules that interact—usually sub-optimally—with the target. Leads usually have properties that could be optimised to yield potential drugs. Previously, ILP has been used to assist ligand-based identification of leads by constructing pharmacophores (3-d constraints on the location of some functional groups on the ligand). The focus of this paper is to use 3-d information about the target to identify new drug-like molecules

The chemistry of drug-development is a cyclic process (see Fig. 1) that attempts to find potential drugs by identifying small molecules that act as “keys” that fit into protein “locks” (in the example just above, plasmepsin II constitutes the lock, and a molecule that inhibits its activity is a key). Depending on whether the structure of the lock or key—or at least of their important parts—is known, there are two different routes to drug-design. If important structural features needed for a key (a potential drug) are known, then these features can be used to search databases of known compounds.Footnote 1 On the other hand, if we know the structure of the target (the lock), then we are able to design new molecules directly to complement the target’s structure. This second approach is commonly referred to as de novo design of drugs.

Of the two approaches, de novo design is the more direct, and attractive. According to the sc-PDB database, out of about 100,000 structures listed in Protein Data Bank (PDB), only about 3500 unique proteins have the ability to bind with high affinity to a known set of drug compounds (Desaphy et al. 2014). Moreover, infectious diseases like malaria and tuberculosis are known continually to develop resistance against traditional compounds (Wongsrichanalai et al. 2010), leaving progressively little scope for using similar drug compounds as leads. Any de novo approach that does not require prior knowledge of inhibitors and directly uses the three-dimensional structure of a target protein (the receptor) to design molecules that could bind to the receptor is therefore of significant interest. This paper is concerned with the use of Inductive Logic Programming to assist in the development of such an approach.Footnote 2

We will rely on deducing properties of the active site using molecular interaction fields (MIFs) of targets with small organic molecules called probes. An MIF, ideally, is a continuous surface, that at each point in three-dimensional space gives the interaction energy of a target with the probe. In the method proposed here, given a target, (discretised approximations of) the MIF of the target with a number of probes are obtained. Our goal will be to identify favourable interaction patterns from the MIFs of one or more probes that occur repeatedly in a series of related targets. We pose this as a problem of identifying frequent cliques in a complete graph in which vertices are MIF interaction points and edges are labelled with distances between points. The occurrence of a clique in the MIFs of more than one target requires all edge-distances to be checked with some tolerance. In this paper, we will identify such frequently occuring “elastic” cliques using an ILP engine. The cliques identified specify constraints on high-energy interaction points for the target and each probe. This allows us to do two things. First, we are able to conjecture residue-level information about the target site. Second, we are able to use the constraints to specify a pharmacophore-like constraint that can be used either to assist in the construction of new molecules, or to search existing databases. In fact, it will generally be insufficient to look simply at high-energy interactions of probes with targets, and we also incorporate some well-understood domain specific constraints that follow from geometric and pharmacological requirements of the target site. These can be naturally incorporated as inputs to the ILP engine. We first summarise the principal contributions of the paper. To the chemistry of drug-design, this paper’s contributions are as follows:

  • This is the first method that can perform multi-target, multi-probe drug design using molecular interaction fields, at the same time incorporating target-specific and probe-specific constraints in a flexible manner. The multi-target approach results in a consensus active-site approach that cannot be achieved with a single-target approach.

  • To the best of our knowledge, this is the first report of pharmacophores to aid antimalarial research, obtained using a series of related aspartic proteases. The screens show that we are able to identify known inhibitors with good precision and recall.

  • There are many ligand-based approaches in field of drug design. These require prior information about inhibitors but there are very few receptor-based methods that use the structure of proteins to generate pharmacophores. Even those that do often require the protein structure along with a bound ligand (co-crystallised, or docked using a computer simulation). Here we construct pharmacophores without the need for inhibitors. This is of special relevance with the emergence of drug-resistant parasites for diseases like malaria. In such cases, the use of historical databases of inhibitors is of limited value.

Based on the long experience of one of the authors in the pharmaceutical industryFootnote 3, and a recent publication in the chemical literature (Kaalia et al. 2015) we have good reasons to believe that these are significant contributions to the area of drug-design. To the field of ILP, this paper’s contributions are as follows:

  • The paper constitutes an application of ILP to a real-world problem of significant scientific and industrial interest. It also moves forward significantly the use of ILP in drug-design from a ligand-based approach to a receptor-based approach. This is the first demonstration of the use of ILP for mapping the active-site in receptor-based drug design.

  • We demonstrate on a real problem, the principal feature of an ILP-based approach that differentiates it from many other forms of machine-learning, namely: the incorporation of diverse aspects of domain-expertise as background (prior) knowledge. Our results also show performance and computational benefits of incorporating domain constraints when conducting a resource-bounded search for solutions.

  • On a more technical note, the application also demonstrates the use of an explicitly defined “refinement operator” in the search to take into incrementally extend frequent cliques to larger ones. To the best of our knowledge, no ILP applications have employed refinement operators in this manner.

The work presented here is a substantial extension of the work in Kaalia et al. (2015). The principal differences are these: (a) Here, the focus is on the use of ILP for the problem. To this end, we have included algorithmic descriptions of the procedures used to find frequent cliques; (b) In Kaalia et al. (2015), only maximally-specific pharmacophores are considered. Here we provide a method of extending this to more general pharmacophores using the notion of quasi-cliques; (c) The assessment in Kaalia et al. (2015) are largely of a qualitative nature, focusing on the chemistry implied by the results. This paper contains quantitative assessments of results in a manner familiar to researchers in machine learning; (d) We have included comparisons against a Baseline variant that allows us to assess the role of domain-knowledge; and (e) We have created an additional set of decoys and present results on this set, that allows us to compare against a random choice approach.

The rest of the paper is organised as follows. In Sect. 2 we describe the use of molecular interaction fields (MIFs) to characterise targets. High-energy points in the MIFs from multiple probes are treated as vertices in a graph. Section 2.1 describes the identification of frequent cliques in such graphs. We would like the cliques identified to be meaningful for drug design. The use of an ILP engine provided with domain-specific constraints to find meaningful cliques in MIF-graphs is described in Sect. 3. Deriving pharamcophores using the cliques found is described in Sect. 3.4. The application of the approach to the discovery of antimalarials is in Sect. 4. Related work is in Sect. 5, and Sect. 6 concludes the paper. The Appendices contain some relevant graph-theoretic terms and concepts; and a brief description of the chemical rationale underlying the domain-constraints used in the paper.

2 Drug design using cliques from MIF surfaces

A molecular interaction field, or MIF, denotes the potential energy variations arising from the interaction of a target molecule with an atom or small group of atoms called a probe. The steps involved in computing MIFs are described in detail in Cruciani (2006). The main ideas are these: (1) the target is usually taken to be a rigid structure, or in one of small number of alternative shapes or conformations; (2) The atomic coordinates (in three-dimensions) of the target are known, either experimentally, or theoretically from energy-minimisation simulations; (3) MIF values are to be computed at points on a rectilinear grid structure, within which the target is placed; (4) The electrostatic potential is computed at each point on the probe. For each point, this is the amount of work required to bring the probe from infinity to that point. If the probe is a unit positive charge, then the reader will recognise that the MIF is just the electrostatic potential of the target at that point. The energy calculations consist of four different components, accounting for Van der Waals interactions, electrostatic interactions, directional interactions due to hydrogen bonds, and interactions arising from displacement of water molecules. The details are unimportant here, but broadly speaking, the net result is a set of grid positions where the target interacts favourably with the probe, and those where it does not.

Fig. 2
figure 2

A 1-d example of using MIFs to characterise energy interactions between targets and probes. a The energy variation at pre-specified x-values (“grid points”) between a (hypothetical) target and two probes (the energy of interaction from one probe is shown here in red and the other in black). The red circles are the high-energy points for the red-probe, and the black circles are the high-energy points for the black probe. What constitutes a “high-energy” point is determined by probe-specific thresholds on energy-levels, shown by dotted horizontal lines; b the high-energy points A–G as vertices in a complete graph. The edge-labels are distances between the points; c a subgraph of the high-energy interaction graph. Since the graph in b is complete, it is a clique, and any subgraph of it is necessarily also a clique; d topologically equivalent cliques, arising from stretching or contracting distances (Color figure online)

A 1-dimensional example is shown in Fig. 2a–d [from Kaalia et al. (2015): in reality, the MIF is a three-dimensional surface]. In Fig. 2a, we focus on points of high-energy, denoting favourable interactions with two probes. These are shown here as the grid points A–G, obtained using probe-specific thresholds on the energy of interaction. The energy levels corresponding to the threshold are shown as broken lines: grid points with interaction energy above the threshold (A–C for one of the probes, and D,E,F,G for the other) are the ones of interest. Figure 2b is a graph-based representation of how grid-points with high-interaction energy are spatially related to each other. In the graph, vertices denote points of high interaction energy and edges are labelled by the (Euclidean) distance between the points. This graph is a complete graph, or a clique (that is, every pair of vertices has an edge between them). Given a set of related targets, we are interested in finding the largest such clique that is common to all (or most) of them. Thus, what we want is to identify maximal cliques that are most frequent, or “frequent maximal-cliques” for short. Some graph-theoretic terms and concepts of relevance to this paper are reproduced in “Appendix 1”.

Although related, no two targets would be identical, which gives rise to the following difficulties:

  1. 1.

    A maximal clique from any one target (such as the one in Fig. 2b) would not necessarily occur in the MIF of a different target. We will therefore often be looking for common subgraphs across target MIFs (like Fig. 2c). However not all common subgraphs may be useful for drug-design.

  2. 2.

    Positions of equivalent vertices may not be at exactly the same positions in the graphs obtained from targets. Checking for the occurrence of subgraphs would therefore have to employ notions of topological equivalence that allows for shrinking and stretching. However, to prevent completely dissimilar cliques from being judged as being equivalent (like the ones in Fig. 2d), We therefore want to place some restrictions on the amount of such deformations.

We account for both these issues by stating our problem as follows. Given a series of targets, we want to find the largest, most frequently occuring cliques, that satisfy certain topological and domain-specific constraints. The former is to account for variations in targets and the latter is to allow meaningful drug design.

2.1 Discovering frequent MIF cliques

The clique detection problem normally refers to the task of determining whether a graph has a subset of vertices such that every pair of vertices in the subset are connected by an edge (that is, the subset represents a complete subgraph of the original graph). The problem extends naturally to multiple graphs, with the task of determining whether the same clique occurs in all (or many) of the graphs. A straightforward approach for MIF-graphs, based on the Apriori algorithm (Agrawal and Srikant 1994) is shown in Algorithm 1.

figure a

Here are some observations related to this procedure:

Pruning :

For frequent clique finding to be effective we need: (a) the generation of cliques to be efficient; and (b) efficient testing to see if a clique occurs in a set of graphs. Concerning (a), Since all the MIF-graphs are complete graphs, all vertices in G are connected. So, in the worst case, \(O(N^{k})\) subgraphs will have to be examined. The procedure in Algorithm 1 exploits the Downward Closure Property (see “Appendix 1”). This is used in Algorithm 1 to examine possibly fewer cliques than the worst-case. Concerning (b), the general problem of finding whether a subgraph is contained in a supergraph is the subgraph isomorphism problem, which is computationally hard (Michael and David 1979). We have some advantage in that our vertices are points in a metric space, and edges are labelled with distances. We will however require a form of error-tolerant matching, as described next.

Approximate matching :

In practice, when judging the occurrence of a clique (subgraph) in a MIF-graph (the supergraph), we will not be able to obtain a perfect match of a grid point in the MIF with the position of a vertex in the clique. Instead, we require the grid point to be within some \(\epsilon \)-radius of the position of the vertex in the clique. The matter is thus somewhat like finding the occurrence of a subgraph in which the vertices are connected by elastic bands (or springs). In this paper, we will employ a logical encoding that will allow a usual logic-programming theorem prover to decide whether a clique occurs in a graph. The backtracking-based approach used by logic-programming systems may not always be the most efficient, but it suits the use of ILP (described below). Further, although we require cliques to occur in the MIF-graphs of all targets, this can be easily changed to cliques occuring in some minimum number of targets (a “support” threshold in the literature of frequent pattern-mining).

Incremental search :

The function \(select\_graph\) returns one of the MIF-graphs. If the clique sought has to occur in all target MIFs, then it does not matter, conceptually speaking, which of the target MIF-graphs is selected. However, some MIF-graphs may be smaller than others, and this can make a computational difference. The function \(extend\_cliques\) extends existing frequent cliques in \({H}_{k-1}\) by adding a vertex, and checks that the resulting clique occurs in the remaining \(T-\{G\}\) MIFs (in the approximate sense just described).

There are several frequent subgraph mining (FSM) methods that have been developed based on the Apriori approach that are essentially no different to the procedure \(mif\_cliques\) (see Jiang et al. 2013). In principle, any of these could be used to find frequent cliques in graphical representations of MIFs (with some modifications restricting subgraphs to cliques, and isomorphism checking to handle distance-based edge labels, as described). There are some practical difficulties however. The MIF from each probe can have many points of high-energy interaction, and the problem is compounded further by the use of multiple probes. As a result, MIF-based graphs can have several hundred vertices, representing high-energy interaction points with probes. Even if the size of the clique we seek is fixed to some k, since the graphs are complete there is no straightforward algorithmic reduction in complexity possible. Some computational gains may be possible however with the use of probe, target and clique-specific constraints. Examples of the constraints we use in this paper are shown in Table 1.

Table 1 Examples of constraints used to reduce the complexity of clique-finding

Constraints like these can reduce the complexity of clique-finding, provided we have an approach that uses them directly when constructing acceptable cliques. It is possible to develop a special-purpose FSM that has these properties: for example, a specialised form of \(extend\_cliques\) in Algorithm 1 can be used that includes such constraints. The difficulty is that this requires re-programming the function each time the domain-constraints are altered.

In this paper, we use an Inductive Logic Programming (ILP) approach to implement frequent clique finding with domain-constraints. Our principle motivation to do so is that ILP engines are general-purpose programs that can be specialised by the inclusion of domain-specific background knowledge. We believe this provides a scalable approach to the problem of inclusion of domain-knowledge into the basic clique finder in Algorithm 1.

3 Domain-specific MIF clique finding with ILP

We do not describe general details of an ILP system here: the reader is refered to a good general survey like Muggleton (1994), and to any good text on logic programming for representations of the statements here. For our purposes, it is sufficient to know that although we will be presenting logical statements in a form of English, the ILP system uses the Prolog language to represent the statements.

3.1 Representation

We start with MIF-graphs. The MIF interaction of multiple probes with a target will be represented by a logical statement of the form shown in the example below.

Example 1

Clausal representation of a MIF-graph:.

figure b

In this paper, this will be the representation adopted for each of the T MIF-graphs. Readers familiar with the ILP literature on inverse entailment (Muggleton 1995) will recognise this as the most-specific clause given: (a) a target protein; (b) its MIFs with multiple probes; (c) background knowledge for determining peaks in MIFs; and (d) allowing the computation of the (Euclidean) distance between pairs of points. A moment’s reflection will convince the reader that this clause encodes the complete graph of the kind shown in Fig. 2b, generalised to three-dimensions (jumping ahead, it is the clausal-representation of the set \(\bot \) in Algorithm 3). The distances \(d_{i}\) and tolerances \(\epsilon _{i}\) are numbers, and any clique of interest is a logical statement that can be obtained from most-specific statement. For example, a clique of the kind in Fig. 2c could be represented by a clause.

Example 2

Clausal representation of a clique:

figure c

It is evident that this represents a clique with 4 vertices. In ILP terms, this is a clause that logically entails the most-specific clause, and it is the task of the ILP system to find such clauses, usually by employing some form of combinatorial search (like a branch-and-bound search).

3.2 Constraints

The computational complexity of the task should now be evident: given a most specific clause encoding a MIF graph with N vertices, the ILP engine is looking to choose k vertices from the N, and obtain values for the corresponding pairwise distances and tolerances such that the resulting logical statement is true for all or most targets provided. In the worst-case, for an ILP engine employing a branch-and-bound search, this is \(O(N^{k}k^{2})\). By inclusion of domain-constraints, we can try to lower N, remove some edges in the graph, and avoid exploring all of the search space. For an ILP engine, these constraints form part of the background knowledge B that is provided as input to the system. We describe the constraints provided for experiments in this paper (“Appendix 2” describes the chemical intuition underlying some of these constraints).

3.2.1 Probe-specific constraints

These relate to the probes being used:

  • Energy In Fig. 2a, thresholds on the energy levels are used to identify points of high-energy interaction between a target and a probe. Since each point corresponds to a vertex in the MIF graph, it is evident that changing the threshold will alter the number of vertices in the graph.

  • Occurrences All cliques have constraints on the minimum and maximum number of vertices from a hydroxyl, amide, and carbonyl probes. This is not independent of knowledge of the active-sites of the target. This restricts the number of possible cliques that can be examined.

3.2.2 Target-specific constraints

These relate to information about the targets:

  • Focus Although of less relevance if the cliques being sought occur in all MIF-graphs, it may be important to ensure that frequent cliques occur in the MIF-graph of at least one specific target protein. For example, for the case study in this paper, the target plasmepsin II is of special importance, and we want frequent cliques to occur in the MIF-graph of that protein.

  • Active-site size The vertices identified by the clique will eventually be translated into pharmacophoric constraints on potential ligands. These are usually small molecules. It makes little chemical sense therefore to identify cliques with large inter-vertex distance (the area covered by the active site may be exceeded). Ligands are also more flexible than their protein targets. To allow for some conformational movement, it is advisable not to have vertices too close to each other (Van der Waals forces may develop when the points are too close, which we would like to avoid). Together, these translate to minimum and maximum constraints on distance between vertices in the MIF-graph.Footnote 4 Edges that do not satisfy these constraints are removed from the MIF graph. The MIF-graphs are then no longer complete graphs (that is, not all vertices are connected to each other).

  • Anchor For the case study in this paper, we want to ensure that each clique has at least one hydroxyl vertex that is not far from an aspartic acid residue. This is because the application we consider involves aspartic proteases. In particular, we will be using the locations of aspartic acid residues in plasmepsin II.

3.2.3 Generic constraints

The following constraints are generic to the clique-finder:

  • Tolerances Without specific values for inter-vertex distances (the d’s in the examples earlier), it is not meaningful to test whether a clique does or does not occur in the MIF-graph of a target. Also, since not all targets interact in the same way to probes, a clique rarely occurs precisely in the same way in all targets. So unless tolerances are allowed on inter-vertex distances (the \(\pm \epsilon \)’s in the examples earlier), almost all cliques will be infrequent. Distances and tolerances have to be estimated from target MIFs. Only those distances that satisfy the distance constraints described are of interest. We would expect the tolerance value to be small (no more than an Angstrom or so), in order that the design of ligands is unaffected.

  • Clique-size We have already stated the worst-case complexity of finding a k-vertex clique in a graph with N nodes (\(O(N^{k}k^{2})\)). This bounds k to some fixed value.

  • Number of cliques. For large graphs, the exponent k does not always reduce the complexity to manageable levels. For tractability, it will often be necessary to bound the total number of cliques examined. This will clearly lose any property of completeness for the clique-finder.

Broadly speaking, ILP engines allow constraints of these kinds to be encoded in one of two main ways: either as definitions in the background knowledge that control the specification of the search-space; or as the definition of statements that control the enumeration of elements in the search-space. We illustrate each of these in turn.

First, constraints can be encoded directly as part of the background knowledge. For example, the locations of probe peaks can be computed by constraining MIF energies to be above some probe-specific energy thresholds.

Example 3

Computing high-energy peaks in MIF surfaces:

figure d

Distance-computations can also be defined in the background knowledge, and can be constrained to check ranges:

Example 4

Constrained distance computation:

figure e

Background knowledge definitions can also contain statements establishing values of parameters required.

Example 5

Parameter values:

figure f

The second way in which an ILP system allows the incorporation of constraints is in the form of specific kinds of statements that are checked during the ILP engine’s search for hypotheses. For example, here is a statement that checks if a clique (recall that this is encoded by the ILP engine as a first-order definite clause) fails the constraint of the distance from a hydroxyl peak to an ASP dyad location in plasmepsin II. If the check fails, then the clique is discarded (“pruned”):

Example 6

Search constraint:

figure g

The specific ILP engine we use also allows one additional way in which enumeration of the search-space can be controlled. This is in the form of explicit specification of “refinement operators” that allow the ILP engine to enumerate the search-space much more efficiently. Normally, ILP systems designed to construct and use a most-specific clause usually search the space by starting from a trivial statement (“All proteins are potential targets”) and progressively adding conditions, one at a time, from the most specific clause. This is not well-suited to the representation we have shown earlier, which requires condition-triples to be added for every pair of vertices in a clique (one condition for each of the vertices, and one for the pairwise distance between them).Footnote 5 We use the facility of being able to specify a refinement operator to change this form of default behaviour. This allows us to enumerate complete cliques that satisfy constraints like these:

  1. 1.

    All clauses encoding cliques of size n are refinements of clauses encoding frequent cliques of size \(n-1\) (if they exist); and

  2. 2.

    Extending a clause encoding a clique of size n to a clique of size \(n+1\) involves adding all conditions from a most-specific clause \(\bot \) representing a new vertex and all pairwise distances of the new vertex to vertices in the existing clique.

Here is an example of a refinement operator that implements these conditions (\(\Box \) represents the empty clause):

Example 7

Clausal definition of a refinement operator:

figure h

(This can be generalised to add more than one vertex at-a-time, and more statements would be needed for boot-strapping the search, with no prior frequent clauses.)

At this point the reader may be concerned that specifying such a refinement operator may be as hard as writing a special-purpose frequent clique finder. Writing refinement operators is indeed not a straightforward business. Fortunately, this has to be done once, and as is evident, is domain-independent. Variations in domain-specific constraints are expected from one problem to another. These remain somewhat easier to write and modify.Footnote 6 Of course, we do not have to specify a refinement operator for the ILP engine to function—their use here is purely for efficiency. The usual general-to-specific or specific-to-general operators used by most ILP engines should also yield the same results, with more computational (but less programming) effort.

3.3 Implementation

Without getting into the minutiae of the implementation, it is sufficient to think of the ILP engine as implementing the simplified procedure in Algorithm 2. In our case, we require that this procedure correctly returns frequent cliques of some size k (we cannot guarantee completeness unless the resource-bound \(n_{max}\) is large enough). This will minimally require that the refine function is able to extend a clique of size k to a set of cliques, each of size \(k+1\). The ILP-based frequent clique-finder is then used in a straightforward manner to find frequent cliques of size at most \(k_{max}\) (see Algorithm 3), which is the requirement of the \(mif\_clique\) procedure in Algorithm 1.Footnote 7

figure i
figure j

The procedure in Table 1 is the MIF-graph version of the ILP-based clique finder for ligands described in Finn et al. (1998). We can also distinguish a special case of the procedure With \(B = \emptyset \). In this case we obtain an unconstrained frequent-clique finder in MIF-graphs. This can be seen as the MIF-graph equivalent of the frequent clique-finder for ligands described in Podolyan and Karypis (2009). This latter procedure will be used in experiments below as a baseline for comparing the performance of the domain-specific MIF-clique finder.

3.4 Pharmacophores from cliques

Any of the (frequent) cliques found by the ILP engine can be used to act as a template for a pharmacophore. Recall that a pharmacophore is a collection of chemical properties—called the pharmacophore’s features—such as hydrogen bond donor, hydrogen bond acceptor, electrostatic and hydrophobic interactions complementary to a target’s active site; along with three-dimensional constraints on distances between the features. There is a straightforward translation of the cliques obtained into pharmacophores. The vertices (carbonyl, hydroxyl and amide) are first translated into pharmacophore features of donors and acceptors (carbonyls map to acceptors, hydroxyls and amides to donors). The distance between the vertices translates to the distances between the corresponding donors and acceptors in the pharmacophore.Footnote 8

It is worth reiterating at this point that in ligand-based approaches, a pharmacophore is derived from the structure of one or more known inhibitors of the target. Here, cliques are obtained using the target’s interaction with the probes. The resulting pharmacophores are based on the structure of a family of receptors, which is an important conceptual difference. Ligand-based approaches are also subject to the “excluded-volume problem”. The problem stems from the fact that a ligand atom cannot occupy a space already taken up by a protein atom. This information is not usually available when generating pharmacophores purely based on ligand-structure. Here, since the MIF surfaces are not calculated for grid points lying inside the protein, the volume excluded by the target is automatically taken into account.

3.4.1 Generalising cliques

It is evident that a frequent clique yields a pharmacophore with maximal distance constraints on features. Pharmacophores with the same number of features, but fewer constraints can be obtained as quasi-cliques from the subgraph lattice of a frequent clique (see “Appendix 1” for the meaning of these terms). An example of this lattice is in Fig. 3. With the clausal representation adopted here, this lattice is part of the clause subsumption lattice familiar to ILP practitioners (dropping an edge amounts to dropping a literal in the clausal representation of the graph). Adopting the convention used in the ILP literature, we will call pharmacophores derived from cliques at higher levels in the lattice as being more general than those from a lower level. In this paper, we will be looking at pharmacophores derived from frequent cliques (Level 0) and from some kinds of quasi-cliques at Level 1 only (the details are in Sect. 4.4).

Fig. 3
figure 3

The subgraph lattice obtained from a clique. The clique is from Fig. 2. Vertices in the lattice are restricted to connected subgraphs of the clique. Using the terminology in “Appendix 1”, each vertex is a quasi-clique. The term “almost clique” is used to denote a subgraph at Level 1. That is, these are the quasi-cliques resulting from dropping a single edge from the clique (Color figure online)

We now describe an application of the approach to the design of antimalarials.

4 Case study: discovering antimalarials

We investigate the ILP-based method of determining multi-probe MIF-cliques using a series of proteins known to form targets for malaria. Specifically, we have structures of six proteins related to the receptor plasmepsin II; and MIF data from the interaction of the six proteins with hydroxyl, amide and carbonyl probes. Plasmepsin II is involved in the haemoglobin degradation pathway of Plasmodium falciparum, the parasite most commonly involved in malarial deaths: plasmepsin II is a known target for anti-malarial drugs against this parasite. Our aim is to discover frequent multi-probe MIF cliques using the ILP-based procedure just described; translate these into pharmacophore-like constraints on potential ligands; and evaluate these constraints quantitatively.

To the best of our knowledge, de novo design of this kind has not been attempted with this series of target proteins.

4.1 Data

4.1.1 Training

Training data are obtained from six proteins targets for antimalarials (specifically, plasmodial aspartic proteases). These are: Cathepsin D, Pepsin, Plasmepsin Vivax, Plasmepsin I, plasmepsin II and plasmepsin IV. Crystal structures of five of these are known in advance. The structure of plasmepsin I is based on a model. Three probes will be used: amide (N), carbonyl (O) and hydroxyl (OH).

The program computing the MIF surface usually employs a much finer-grained grid than is necessary for our purposes. We pre-process the data to obtain a coarser characterisation of the MIF surface by clustering together groups of grid-points that do not show significant variations in energy. The energy for such clusters are taken to be the average energy of all points in the cluster, and the centre of the cluster is the coordinate-mean of the points in the cluster. By “significantly different” to points in an existing cluster, we mean that the energy of a grid-point is at least 2 standard deviations away from the average energy of the cluster.

The result is for each target and probe a set of (mean) grid-points, each associated with a (mean) energy value. The table below shows the numbers of points in the averaged MIF field for each of the targets and probes:

Target

Probe

N

O

OH

Cathepsin

209

221

228

Pepsin

159

101

160

Plasmepsin Vivax

189

149

193

Plasmepsin I

228

232

258

Plasmepsin II

179

174

193

Plasmepsin IV

188

158

213

On average, the clustering reduces the number of grid points by a factor of 10 for each protein-probe combination. That is, the number of grid points before clustering are in the order of 1000s. Target-specific MIFs therefore contain a description of the mean MIF-values for each target-probe combination. That is, the coarse-grained MIF surface for Cathepsin will consist of \(209+221+228=658\) points, for Pepsin, 420 points and so on. For any target, each point is a potential vertex in the MIF-graph for that target.

4.1.2 Testing

Our interest here is in two different kinds of populations of molecules. First, the antimalarial campaign by pharmaceutical companies like GlaxoSmithKline and Novartis has resulted in a large set of chemicals (ChEMBL: Gaulton et al. 2012) for which bioactivities against several plasmodial aspartic proteases are known. These are used to identify sets of positives and negatives. The subset of positives is composed of 568 chemicals which are exclusive inhibitors of aspartic proteases (plasmepsin II, cathepsin D and pepsin); and set of negatives contains 626 chemicals that do not target aspartic proteases. These include non-protease inhibitors (393; mostly kinase, phosphodiesterase, phosphatases, reductases and cytochrome inhibitors) and protease inhibitors (233; but excluding inhibitors of aspartic proteases like plasmepsins, cathepsin D and pepsin). The details are below:

Targets

Positives

Negatives

PlasII

236

0

CathD

243

0

PlasII+CathD

89

0

Protease

0

233

Non-Protease

0

393

[6pt] Total

568

626

The chemicals in these two subsets do not overlap, and their inhibition constants like Ki or IC50 are known. To use active inhibitors for screening by pharmacophores all the inhibitors of the selected target proteins whose inhibition constant (Ki or IC50) is less than 50 \(\upmu \)M are selected. Finally, ligand flexibility is considered by generating conformers for each molecule. As usual, a pharmacophore “hits” a small molecule if it occurs in at least one conformer of the molecule.

Secondly, we are interested in a set of “decoy” compounds. These are small molecules that are physico-chemically similar to the aspartic protease inhibitors (positives). Decoy compounds are extracted from the ZINC database (http://zinc.docking.org/). The compounds are similar to the positives in their physico-chemical properties (like molecular weight, HBA, HBD, logP, number of rotational bonds and so on: slight differences are permitted, like \(\pm 3\) for HBA and HBD; \(\pm 30\) Da for molecular weight and so on), but are structurally dissimilar (Tanimoto coefficient \(<\) 0.7). At least 20 decoys for every positive example is found, resulting in a total 11878 decoy compounds.Footnote 9

In all cases, about 250 low-energy conformations are generated for each chemical in the positives, negatives and decoys to cover the space of possible conformations of the small molecules. In principle, a good pharmacophore model should be able to identify more of the positives and fewer of the negatives.

4.2 Background knowledge

The (coarse-grained) MIF-surface for a target is encoded using the predicate:

$$\begin{aligned} Has\_Energy(t,p,l,e) \end{aligned}$$

that is true if the (mean) interaction energy of target t with probe p at (cluster) location l (locations are specified by three-dimensional coordinates) is e. The principal predicates defined using this are:

$$\begin{aligned} Has\_Probe\_Peak(t,p,l) \end{aligned}$$

and

$$\begin{aligned} Dist(l_{1},l_{2},d,\epsilon ). \end{aligned}$$

The first predicate encodes the locations l at which the interaction energy of target t with probe p exceeds the lower bound on energy for the probe. The second predicate is true if the distance between locations \(l_{1,2}\) is \(d\pm \epsilon \). Both these predicates incorporate domain-specific constraints for energy thresholds, distance constraints and tolerances. For the ILP engine used here these are communicated through user-defined parameters.

As described earlier, the background knowledge also includes definitions of conditions under which a clause (or clique) should be removed from the search. This is done using the predicate:

$$\begin{aligned} Prune(c) \end{aligned}$$

that is true for clauses c that should be pruned from the search. These are used to remove cliques in which : (a) vertices contain probes that fall below a lower bound on the number of occurrences of the probe; (b) vertices contain probes that exceed an upper bound on the number of occurrences of the probe; and (c) there is no hydroxyl peak in plasmepsin II within 5 Å of an ASP dyad.

In addition to the problem-specific predicates above, it is possible to configure the ILP engine to enumerate cliques using a refinement operator. For a given clique size k, we use a refinement operator that incrementally constructs cliques of size k from frequent cliques of size \(k-1\) (see Sect. 2.1). On the first iteration, the refinement operator constructs cliques ab initio, since no frequent cliques have been determined.

We can also ensure that some of these conditions tested by the Prune predicate do not arise by ensuring that the refinement operators do not generate such clauses. This would make the definition of these operators somewhat specific to MIF-graphs.

4.3 Algorithms and machines

The principal programs used in this paper were: GRID (Goodford 1985) for generating the molecular interaction fields for target-probe pairs; and the procedure in Table 1. The ILP engine used by the procedure is Aleph (Srinivasan 1999). We distinguish two variants of the MIF-clique finder: one that uses the domain-constraints described above, and a second that does not. The latter will act as a baseline for judging both the utility of the background knowledge, as an application of standard frequent clique-finder to the problem.

Generation of conformers for the molecules in the test set were done using the OpenBabel program (O’Boyle et al. 2011). Pharmacophore-based searching of the dataset is carried out using a standard chemical development kit package (CDK: Steinbeck et al. 2003) where the pharmacophores features are first converted into SMARTS patterns (Desaphy et al. 2014). Chemicals for the decoy dataset were identified using the DecoyFinder program (Cereto-Massaque et al. 2012).

4.4 Methods

We adopt the following method to test the \(ilp\_mif\_cliques\) procedure in Table 1:

  1. 1.

    With and without background knowledge:

    1. (a)

      For each malaria target T and each probe P obtain the MIF values at pre-defined grid points.

    2. (b)

      Find the set of cliques for clique-sizes at most \(k_{max}\) using \(ilp\_mif\_cliques\).

    3. (c)

      Convert the cliques found into pharmacophores.

    4. (d)

      Obtain the performance of the pharmacophores quantitatively, using the test datasets D1 (positives and negatives) and D2 (positives and decoys).

  2. 2.

    Compare the quantitative performance of the pharmacophores with and without background knowledge.

For simplicity, we will call the frequent-clique finder with domain constraints as the “Domain-Specific” variant; and the frequent-clique finder without background knowledge as the “Baseline” variant. The comparison in the last step serves the following purposes. First, we are able to obtain an assessment of the utility of the background constraints. Secondly, we can view this as a comparison against a procedure used in the literature for ligand-based frequent clique finding (Podolyan and Karypis 2009), adapted to the problem of finding frequent cliques in MIF-graphs. The Baseline variant also extends the results that would be obtained by the commercial program FLAP (Cross et al. 2010), that finds up to 4-vertex cliques (not necessarily frequent) using target structure.

The following details are relevant:

  • We use the following probe- and target-specific constraints in experiments: (a) Minimum distance; (b) Maximum distance; (c) Minimum and maximum occurrences; (d) Energy thresholds; and (e) Distance to anchor residue. The values used here are: Maximum distance: 10 Å; Minimum distance: 3 Å; Minimum occurrences: 1 (all probes), Maximum occurrences: 2 (all probes); Energy thresholds: \(-\)3.5 (N), \(-\)2.2 (O), \(-\)3.0 (OH); and Anchor distance: 5 Å. For more details on some of these constraints, see “Appendix 2”. The following values were used for generic constraints: Distance tolerance: 0.70 Å; Max. clique size: 8; Max. number of cliques: 100,000 (for each clique size). In addition, we impose a time-limit of 2 days for finding frequent cliques of any given size. This is to keep runtimes within reasonable limits.

  • Language constraints provided to the ILP engine are in the form of “mode” declarations that specify argument types and input–output roles of arguments for the two main predicates described above, namely \(Has\_Probe\_Peak\) and Dist. We refer the reader to Muggleton (1995) for a desciption of mode declarations.

  • Quantitative assessments will be based on the \(2\times 2\) confusion matrix resulting from the use of pharmacophores to identify aspartic proteases. We will examine performance on 2 datasets: (D1) the set of positives and negatives; and (D2) the set of positives and decoys. As usual, counts will be obtained for: positives that are predicted as inhibitors (true positives, or TP); positives that are not predicted as inhibitors (false negatives, or FN); negatives (or decoys) that are predicted as inhibitors (false positives, or FP); and negatives (or decoys) that are predicted as non-inhibitors (true negatives, or TN).Footnote 10 For convenience, we will call \(P= TP + FN \) and \(N= TN + FP \). The usual measure that balances correct prediction and errors is the \(F_{\beta }\)-score:

    $$\begin{aligned} F_{\beta }=\frac{(1+\beta ^{2})\cdot TP }{(1+\beta ^{2})\cdot TP+\beta ^{2}\cdot FN + FP } \end{aligned}$$

    We are primarily interested in two kinds of pharmacophores: those with high precision and those with high recall. The former are useful for obtaining structure-activity information from the hits, and the latter are useful as screening for potential ligands. The former will have high F-values for \(\beta <1\) and the latter will have high F-values for \(\beta >1\). We will use \(\beta =0.5\) and \(\beta =2\) to assess the pharmacophores tested. On dataset D2, it is of interest in the area of drug-design to calculate the Enrichment Factor (\( EF \)) which measures the performance of the pharmacophores as a model for targeting inhibitors, over a random choice targeting model. This is defined as:

    $$\begin{aligned} EF =\frac{ TP /( TP + FP )}{P/(P+N)} \end{aligned}$$

    Readers will recognise this as the lift, or gain in precision, obtained by using the pharmacophores on dataset D2. An \( EF \) value close to 1 indicates no gain in performance over a random predictor.

  • We expect pharmacophores obtained from frequent cliques to be highly specific (that is, have high precision and perhaps low recall). Using the introduced terminology, we will call these maximally-specific pharmacophores. For more general pharmacophores, we also consider quasi-cliques at Level 1. At this level, we restrict ourselves to quasi-cliques obtained by dropping an edge that may be chemically redundant. For example, if the frequent clique contains an edge between a pair of amides and a pair of hydroxyls then one of these may be chemically redundant (both amides and hydroxyls are hydrogen-bond donors, and there may be no need to restrict the location of both). We will call the resulting pharmacophores as slightly-general pharmacophores. Generalised pharmacophores are assessed in the same way as the maximally specific pharmacophores.

4.5 Results

Summaries of the performance of pharmacophores from the Domain-Specific and Baseline variants on dataset D1 (positives and negatives) are in Table 2. Summaries of the graphs on the training data, and the time taken to find frequent cliques are in Table 3. Recall that each variant actually identifies k-vertex frequent cliques, which are then converted into k-feature maximally-specific pharmacophores. Slightly general forms of these are obtained by examining the quasi-cliques arising from dropping any one constraint that may be chemically redundant. For example, the largest sized frequent cliques found by the Domain-Specific variant have 5-vertices. These translate to 5-featured pharmacophores. “Slightly-general” pharmacophores are obtained by dropping 1 of the 10 edges in the 5-vertex cliques found. The Domain-Specific variant has predominantly one type of clique (\(N_2{-}O_1{-} OH _2\): see Table 4), and the tabulations are the performance of that type. Two kinds of slightly general pharmacophores result by removing the \(N{-}N\) edge or the \( OH {-} OH \) edge, both N and \( OH \) are hydrogen bond donors, and chemically deemed possibly redundant.

At first glance, it would appear erroneous that the Baseline graph in Table 3(a) does not contain all pairwise distances (that is, \((546\times 545)/2\approx 150{,}000\) edges). This is because a target’s interaction points with distinct probes may not all be distinct (that is, the 3-d positions of vertices are not always distinct). This means some pairwise distances will be 0; and others may be duplicated. Zero-distance edges are eliminated, being chemically meaningless; and duplicate edges are represented only once.

Table 2 Performance of pharmacophores on the test-dataset D1 (positives and negatives)
Table 3 Empirical summaries of frequent-clique finding on the training data
Table 4 Pharmacophores obtained from frequent cliques found by the domain-specific variant

For the Domain-Specific variant, as expected, the maximally-specific cliques resulted in the highest precision and lowest recall. Some of the low recall may also result from the limited active chemical space spanned by the chemicals used in the positive examples. This chemical space involves inhibitors for aspartic proteases like plasmepsin, pepsin and cathepsin D which is far less than the chemical space for the negative examples (containing inhibitors for non proteases and other proteases). Eliminating an edge allows us to take into account the flexibility induced when a protein binds to a ligand. Chemical considerations suggest that edges between the two donor sites (N or \( OH \)) are likely candidates for such an elimination: this is confirmed by the results, both of which increase the number of hits (at the cost of precision, of course). These results suggest that these slightly general pharmacophores may be useful for virtual screening (especially the one resulting from dropping the \( OH {-} OH \) edge, which greatly increases recall, without too great a loss in precision).

The Baseline variant clearly results in pharmacophores that have no value either for structure-activity prediction or for screening. Even the smallest MIF-graph that is considered by the Baseline is substantially larger than the Domain-Specific one, and this is reflected in significant increases in the time required to find frequent cliques. Recall that the Baseline is in effect, a resource-bounded version of the frequent clique-finder used in Podolyan and Karypis (2009) with ligands. These results appear to suggest that a direct adaptation of a ligand-based frequent clique finder will not work well: probably because MIF-graphs are substantially larger than ligand-graphs. The caveat is that the frequent clique-finder used is a resource-bounded one (for any size of clique, no more than \(n_{max}100{,}000\) cliques and no more than 2 days of search time). It is possible therefore that given more resources, a better performance may result. Nevertheless, other things being equal, it appears that the Domain-Specific variant yields substantially better pharmacophores, in substantially less time, and we will henceforth not consider the Baseline variant. In summary, the results in Tables 2 and 3 provide evidence for the following:

  • The performance of the Domain-Specific variant is substantially better than the Baseline, and models are found substantially faster with domain- and generic-constraints than without; and

  • Good precision, recall, and F-values are obtained with the Domain-Specific variant on test data consisting of positives (compounds that only target aspartic proteases) and negatives (compounds that do not target aspartic proteases).

Table 5 shows the Enrichment Factors obtained with the Domain-Specific variant on dataset D2 (positives and decoys). It is evident that good enrichment factors are obtained with the Domain-Specific variant The tabulations suggest that the gain in precision may be about 10–12 times higher than the expected precision of a random choice targeting model.

Table 5 Performance of the Domain-Specific variant on the set of positives and decoys

The reader may be concerned that the large number of frequent cliques (here about 2000 for the Domain-Specific variant) will result in diminished comprehensibility. This is correct in the first instance, but we are able to alleviate this concern somewhat by clustering the maximally-specific cliques (see Kaalia et al. 2015). The clustering suggests that there are probably only about 10 different clusters of cliques of type \(N_2{-}O_1{-} OH _2\) found by the ILP engine, with most of the cliques contributing to the hits falling in the same cluster. This suggests that there may be a small number of truly different and active pharmacophores that need to be considered. The analysis in Kaalia et al. (2015) presents a case of why the main type of clique (\(N_2{-}O_1{-} OH _2\)) makes sense chemically, and we refer the reader to that paper for chemical-comprehensibility of the results.

When assessing the body of results presented thus far, an entirely reasonable question to ask is this: how well can we discriminate between positives and negatives, based simply on molecular similarity? Figure 4 shows the the intra- and inter- subset similarity of molecules in the positive and negative examples in dataset D1. Similarity is calculated using the pairwise Tanimoto coefficient from Daylight fingerprints of the compounds.Footnote 11 Figure 4 shows that there is a significant overlap between the bulk of the positive and the negative examples (a subset of positives can be identified with precision with Tanimoto coefficients above 0.4 or so, but recall will suffer). This is also evident from the scatterplot in Fig. 5 generated using multi-dimensional scaling using the Tanimoto similarity: the small set of separable positives is in the upper left-hand corner.

Fig. 4
figure 4

Comparison of similarity (Tanimoto coefficient) between the sets of positive and negative examples

Fig. 5
figure 5

A visualisation of the positive and negative examples that maps the Tanimoto similarities onto a two-dimensional space using multi-dimensional scaling

4.5.1 Note on the precision-recall tradeoff

There are no defined standards for precision and recall values of pharmacophore-based screening of ligand databases. For example, a study on ligand-based pharmacophore models for the discovery of 17\(\beta \)-Hydroxysteroid Dehydroxygenase 2 inhibitors (Vuorinen et al. 2014) reported precision rates of 0.24–0.50, and recall rates of 0.40–0.50. Screening by receptor-based pharmacophore models designed for HIV-1 protease inhibitors (Fisher and Gner 2002) reported precision of about 0.10–0.4; and recall values of 0.007–0.05. This suggests that the values obtained here are quite respectable.Footnote 12

Generalising pharmacophores in the manner we have described—dropping distance constraints based on chemical principles—is one way of improving recall at the cost of precision. There are other ways of achieving this, although results may be less predictable. We have investigated elsewhere (Kaalia et al. 2015) the effect of changinging the distance-tolerance between pharmacophore features from 10 Å as used here to 20 Å. In the first instance, it may be thought that relaxing the distance tolerance should allow more general pharmacophores (and hence increase recall). In fact, the results in Kaalia et al. (2015) do not bear this out, probably because the target’s structure precludes overly-large pharmachores. It may be more productive to conduct instead a heuristic search through the subgraph lattice described in Sect. 3.4.1, guided for example, by F-values, or by the activities of the false-negatives. False negatives may be acceptable if they are restricted to poor inhibitors: the generalisation technique in this paper is agnostic to activities of true-positives and false-negatives (see Fig. 6).

Fig. 6
figure 6

Frequency distribution of true-positives and false-negatives. Across a wide range of bioactivity, there are roughly as many true-positives and false-negatives, suggesting that the models used do not show any preference for any particular bioactivity range

5 Related work

De novo drug-design is an active area of research, with techniques ranging from the purely biophysical to the purely computational. We focus first on related work that has used machine-learning in some manner. It has been our intention here to extend the ligand-based discovery of pharmacophore cliques in Finn et al. (1998) to target-based discovery of MIF cliques. Both the approach of Finn et al. and the paper here use ILP as the basic engine for clique discovery. In the former, background knowledge contains the location of hydrogen bond donors, acceptors, and the location of zinc-sites on ligands. No probe- or target-specific constraints of the kind described here are employed to restrict the search space. Independently of the work in Finn et al. (1998), Podolyan and Karypis (2009) also approach the problem of pharmacophore detection as clique identification. They do not use a general-purpose machine learning program, but a special-purpose program that incrementally constructs larger cliques from smaller, frequent cliques. The Baseline variant that we have constructed here is a direct adaptation of this ligand-based method to the problem of frequent clique-finding in MIF graphs.

There are several other methods that concentrate purely on ligand-based pharmacophore modeling such as HipHop (Barnum et al. 1996), HypoGen (Li et al. 2000), DISCO (Martin 2000), Catalyst (Hecker et al. 2002) and PHASE (Dixon 2006). These cannot be re-used for receptor-based models: we have been able to use essentially the same ILP program used for ligand-based models in Finn et al. (1998) to construct receptor-based models here. pharmacophore generation methods do exist such as LigandScout (Langer and Wolber 2005; Lai and Chen 2006; Schuller et al. 2006). These require either protein structure and at least one known ligand for identifying pharmacophoric features or domain knowledge in the form of residue based interactions. None of these use MIFs to identify pharmacophoric features, or the kind of domain-knowledge we have used here.

Perhaps the most directly relevant computational approach to this paper is FLAP (“Fingerprints for ligands and proteins”: Cross et al. 2010). As in this paper, FLAP uses MIF values at grid-points calculated by the GRID program. FLAP refers to proprietary software that contains, amongst other useful procedures for assisting receptor-based drug design, programs that find 3- or 4-vertex cliques from the MIF surface obtained from some part of the active site of a single target and a probe. There are some difficulties with using FLAP for the problem described in this paper: cliques found here are of size 5; we use MIFs from all of the active site; we can use MIFs from multiple targets and probes at once; and we have ensured that the result conforms to various domain-specific intuitions. We are not aware of any immediate way to extend FLAP to account for these differences. Nevertheless, there may not be any conceptual issues preventing the development of a generalised form of FLAP that has these facilities.

Turning to more general work in the area of machine learning, the research that is clearly relevant is that on frequent subgraph mining (FSM: see Jiang et al. 2013). It is clear that an FSM program can be adapted to the task of finding cliques in MIF-graphs. It is less apparent what modifications would be needed to incorporate flexible methods of incorporating biological and chemical constraints into subgraph mining.

6 Conclusions

This study explores an ILP-assisted approach for de novo drug design in which an ensemble of pharmacophores is designed for a class of target proteins, without any prior knowledge of ligands. The approach here is to first identify frequently-occuring cliques in a graphical representation of energy surfaces. The surfaces are obtained from the energies of interaction of the proteins with some specific kinds of chemicals (or probes). Points of high energy in the surfaces are treated as vertices in a graph, with edges weighted by the inter-vertex distance. Cliques that occur in the energy-graphs of all (or many) proteins form the basis for pharmacophores that can be used to identify or describe potential ligands. The case for using ILP to find cliques rests primarily on its ability to incorporate domain-knowledge directly into the clique-detection process. We have shown here how different kinds of domain constraints like distances between energy peaks, probe-specific energy thresholds, and pharmacological constraints imposed by the target proteins can be encoded as background knowledge provided to an ILP engine that is configured to find frequent cliques in a set of graphs. The result is a general-purpose method that can be specialised to any class of proteins and any number of probes (what will change are some of the domain-constraints—the ILP engine and the algorithm it uses remains the same). To the best of our knowledge, there have been no machine-learning or graph-based methods that have been shown to perform this form of multi-target, multi-probe de novo drug design that takes such a variety of domain-knowledge into account. Of course, there is no necessary requirement for multiple probes, and the technique we have used here can work without modification with multiple targets and a single probe. The notion of a frequent clique with a single target becomes trivial: the approach here will find all cliques occuring in the target, but if this is to be manageable, then we would need more background knowledge than we have here.

We have presented an extensive case study concerned with identifying inhibitors for aspartic proteases, which is of special interest in the design of antimalarials. Again, this appears to be a novel contribution to the specific problem of antimalarial targetting. The pharmacophores obtained from the ILP-discovered cliques were tested on a database of known inhibitors and a set of decoys. The results show that: (a) The ILP-constructed patterns can be used to identify aspartic protease inhibitors with reasonably high precision and recall; (b) The role of domain-constraints is extremely important in being able to find such patterns; and (c) The models found are unlikely to be chance patterns, based on the low hit-rate on decoys.

In this era of Big Data, it may seem a period-piece to consider problems where data are from the interactions of 6 proteins with 3 probes. There are two points that are worth emphasizing here. First, there are many problems where “small data” are the norm [see, for example, Hand et al. (1994) for an entire book of such problems]. This is quite often so in the life sciences, where generating a data instance (for example, the crystal structure of the target site of a protein) involves painstaking experimental work. For the antimalarial problem considered here, for example, we are unlikely to have many more crystal structures than what is available at present. Secondly, the number of data instances can be quite misleading. Each data instance here is a complex object exhibiting a rich internal structure, that entails a very large space of possible patterns. Under such circumstances, the role of domain knowledge to rule out chance-patterns (over-fitting) is substantial, as has been clearly demonstrated here with the discovery of patterns that are able to discriminate both positives from negatives; and positives from decoys. If the importance of domain knowledge for dealing with Small Data problems of this kind is accepted, then it follows naturally that we will need discovery engines that can use such knowledge with little or no re-programming. It is our contention that ILP engines, which explicitly accept background knowledge as input are a natural choice in such cases.

There are a number of ways in which the work described here could be extended. The specific application to antimalarials can be extended usefully in two different ways. First, rather than focus on a general class of proteins like aspartic proteases, we can focus on just a specific target like plasmepsin II. Second, there is significant interest in selective pharmacophores that discriminate one set of proteins from another (plasmodial versus human, for example). Both these tasks are well within the capabilities of the kind of ILP engine used here, and provide opportunities to enrich the background knowledge further. Target-specific knowledge in the form of what is already known about plasmepsin II, for example, could constrain the search for cliques. Looking beyond the specific application presented here, it is naturally of interest to demonstrate the techniques applicability to other classes of proteins and for other diseases. We intend to compare this against the receptor-based work on HIV-1 protease inhibitors reported in Fisher and Gner (2002).

The current implementation is a collection of programs that communicate results through the straightforward mechanism of text files. The natural next step is to integrate these into a single platform, with suitable background knowledge libraries and user-interfaces to allow the approach to be used by domain specialists with little or no ILP knowledge. On the algorithmic front, worst case results for clique-finding are not promising. Nevertheless, there may be better ways to find frequent cliques than the general-to-specific approach we have used here. This is especially the case if we want cliques that occur in all target graphs. For example, it is evident that using the least-general-generalisation (lgg) of most-specific cliques from each target, provided it is well-defined, would immediately yield a maximal clique. Difficulties may arise in computing this lgg, but there may be compact ways of representing multiple edges between a pair of vertices (using intervals of distances, rather than an edge for each distance, for example). The subgraph lattice induced by a frequent clique introduces an interesting direction to proceed when we have structures of receptors and ligands. Once a frequent clique has been obtained from the MIF-graphs of receptors, we have relied here upon chemical intuition to generalise the maximally-specific pharmacophore. If the structure of ligands are also available, then we can search the subgraph lattice to find the pharmacophores that best identify the inhibitors. Again, if this search can be guided by domain-knowledge, then we would expect ILP engines to be a reasonable choice. An altogether different approach would be to convert the entire problem of specifying constraints and finding cliques into some form of continuous optimisation problem for which there are good methods of solution. This may yield a more efficient route than ILP to solving the problem of pattern finding in MIF-graphs for de novo drug design.