Findings

Identification of unknown protein functions is essential for understanding biological processes and beyond [1, 2]. Enzymes are proteins whose function is to catalyse chemical reactions in a living cell. Ascertaining enzymatic mechanisms can have important applications for pharmaceutical and industrial processes in which catalysts are involved [1]. For example, identifying the catalytic mechanism(s) of an enzyme could lead to designing new biocatalysts that give significant cost savings over non-biological alternatives in sectors such as laundry, deodorants, foods and agriculture [1].

Unlike predicting enzymatic functions at the level of the chemical reaction performed [24], the problem of predicting by which molecular mechanism a particular enzyme operates has not been well researched [1]. Two of us, De Ferrari and Mitchell, have recently looked into this question. In that work, we utilised a pattern recognition approach to predict chemical mechanisms from enzyme sequences [1]—to the best of our knowledge, that study was the first attempt to predict enzymatic mechanism in this way.

One notable aspect of that work was the excellent prediction success rate of over 96 % for 248 test enzymes—albeit in a leave-one-out setting—even though the training dataset was small and the simple k Nearest Neighbour rule with k = 1 (k1NN) [5, 6] was the algorithm employed for pattern classification. The k1NN rule is well known to be highly sensitive to errors in the training set [7], in particular when the training dataset is small [79]. For example, the number of training examples required for a k1NN rule to achieve high classification or prediction accuracy grows exponentially with the number of irrelevant features (noise) [7, 9].

In the light of the “anomaly” described above, we have re-analysed that mechanism dataset and our previous classification results—mainly to understand and explain, if possible, the high prediction success rate achieved.

In the following section, we briefly describe our previous work. The “Results” section presents our new findings, and the final section gives our concluding remarks.

To our knowledge, our study was the first attempt at bulk prediction of enzymatic mechanism from protein sequence [1]. The predictive model was an empirical and observational model [10] based on the concept of pattern classification.

Formally, a pattern classification problem deals with the optimal assignment of an object to one of J predefined classes, categories or labels, \(\Omega = \left\{ {\omega_{1} ,\omega_{2} , \ldots ,\omega_{J} } \right\}\), whereby it is assumed that the object is adequately characterized by L features, x i with i = 1, 2, …, L. Typically, the object is represented by an L-dimensional vector x, whose elements (x 1, x 2  …, x i ) are discriminatory features that ideally can identify the object with a low misclassification error rate. In this regard, the classification task is equivalent to establishing a mapping

$$f:\chi \to \Omega$$
(1)

from the feature space χ into the class space Ω, such that \(x \in \chi\) is assigned to its appropriate class label \(\omega_{j} \in \Omega\), where j = 1, 2, …, J. Each point in the class space has a corresponding region(s) or subspace(s) in the feature space defined by the L features.

In our previous study, the feature x i denotes absence (0) or presence (1) of an InterPro signature for an enzyme sequence, i.e., \(x_{i} = \{ 0,1\}\). In other words, χ was a binary feature space \(\chi = \left\{ {0,1} \right\}^{L}\). The class space Ω comprised J discrete points each representing one of the enzyme mechanism labels ω j , extracted from Version 3.0 of the MACiE (Mechanism, Annotation and Classification in Enzymes) database [1113].

The mapping algorithm was the simple k1NN classifier. This algorithm can be basically viewed as a dictionary search [14]. That is to say, all the data points allotted for training are stored in a memory (a dictionary in χ), and a test data point is classified to the class label or labels \(\omega_{j}\) of the closest point in the dictionary, i.e., in χ. The specific implementation used in our calculations was Mulan’s BRKNN algorithm [5, 15].

Generally speaking, the integration process carried out by InterPro’s curators removes many of the redundant signature matches that might otherwise occur. This results in a relatively small number of InterPro signatures being present for the typical sequence in this dataset. Thus, the squared nearest neighbour distance often takes small integer values, and it is common to find plural nearest neighbours an equal distance away. In this case, the label (or label set) most common amongst the ring of nearest neighbours is assigned.

The mechanism dataset consists of 248 enzymes annotated against 71 MACiE labels, where each enzyme is represented by 321 InterPro signatures—i.e., L and J are 321 and 71, respectively. We employed a leave-one-out validation scheme: 247 of the enzymes whose mechanisms were known were utilised as a “dictionary” and the mechanism(s) of the one remaining enzyme was predicted, this processes being repeated 248 times. The simple pattern recognition approach yielded an excellent prediction success rate of over 96 % for the 248 test enzymes.

Methods

In the present work, we are not directly concerned with the question of defining enzyme mechanisms; instead, we just use the mechanism dataset. We focus on finding the reasons why the k1NN rule gave us such good classification results for this small dataset, its size being limited by the considerable experimental effort required to characterise enzyme mechanisms.

While directly visualising the 321 dimensional feature space \(\chi = \left\{ {0,1} \right\}^{L = 321}\) would be impossible, we were able to go through the dataset manually. The mechanism dataset was represented by a 248-by-323 matrix whose rows were the 248 enzymes, and the first and last columns contained the enzyme names (the enzyme sequence’s UniProt accession number) and their associated mechanism class labels, respectively. The remaining 321 columns denoted the 321 InterPro signature features.

We systematically swapped the 321 columns containing the InterPro signature features while keeping the rows and the first and last columns of the matrix fixed.

Results

After a number of iterations, we ended up with a block diagonal version of the original data matrix, see Fig. 1. The figure, a heat map of the data matrix, seems to explain why k1NN yielded the excellent classification results [1]. In the figure, the abscissa denotes InterPro signatures, whereas the vertical axis represents the enzyme sequence’s UniProt accession number and the corresponding MACiE enzymatic mechanism labels of the form M0123. The colour yellow signifies that feature x i (InterPro signature) is present for the enzyme, while the red colour indicates that feature x i is absent for the enzyme.

Fig. 1
figure 1

Heatmap of our data matrix. The horizontal axis denotes InterPro signatures, whereas the vertical axis represents the enzyme sequence’s UniProt accession number and the corresponding MACiE enzymatic mechanism labels of the form M0123. The yellow colour signifies that feature x i (InterPro signature) is present for the enzyme, while the red colour indicates that x i is absent for this enzyme. Enzymes that possess the same MACiE mechanism label reside in a subspace of the feature space χ, which barely overlaps with other subspaces associated with other mechanisms. The inset depicts the heatmap for the dataset matrix corresponding to the InterPro signatures and names of enzymes with the MACiE enzymatic mechanism label M0218

According to Fig. 1, the 321 InterPro signatures are highly discriminating features. Enzymes that possess the same enzymatic mechanism \(\omega_{j}\) reside in a subspace (region) in \(\chi = \left\{ {0,1} \right\}^{L = 321}\) which barely overlaps with neighbouring regions. The inset in Fig. 1 depicts the heatmap of the portion of the dataset that corresponds to the enzymes (and their InterPro signature features) that have MACiE enzymatic mechanism label M0218, i.e. \(\omega_{j} = M0218\). Note that a subspace for a given mechanism can be a composite (union) of non-overlapping “sub-subspaces”. The sharing of the M0218 label by two separate non-homologous sequences illustrates the presence of two distinct proteins, firstly pancreatic lipase and secondly colipase, in the reactive complex.

Out of our 71 regions, only the two regions representing enzymes with MACiE mechanisms \(\omega_{j = 30} = M0348\) and \(\omega_{j = 35} = M0269\) completely overlap. The same four InterPro signature features represent the enzymes that show mechanisms M0348 and M0269, highlighted in red in Table 1.

Table 1 Enzymatic MACiE mechanism labels w j and the number of enzymes reported to possess this mechanism w j

We suggest that our block data-matrix could be employed as an enzymatic mechanism prediction tool—a template against which to match novel enzymes to ascertain their potential enzymatic mechanisms in regard to the 71 mechanisms in the mechanism dataset.

In this work, our mechanism dataset was re-analysed to ascertain as to why a simple but high variance classifier yielded such excellent classification results.

We hope that we have provided a reasonable explanation; the mechanism dataset matrix is block diagonal in the feature and class spaces. In other words, the features (almost) uniquely codify the chemical mechanism of a given enzyme.

Based on these observations, we have also made the suggestion that one might be able to utilise the dataset matrix as an enzymatic mechanism prediction tool.