1 Introduction and Motivation

Identifying protein functions and analyzing their interactions can help to understand the mechanisms that govern the living beings, and accordingly, to establish new effective therapeutic strategies. In most cases, functions of a protein can be predicted through analysis of its structure, itself characterized by the composition of its molecules (e.g., amino acids) as well as their relationships and spatial positions [4].

In this sense, methods are used for separating proteins from their other cellular compounds (e.g., ultracentrifugation, electrophoresis). Then, their structures can be studied by varied methods such as X-ray crystallography, Nuclear Magnetic Resonance or mass spectrometer. Biologists and biochemists from around the world regularly exploit these analysis methods and submit their obtained data (e.g., 3D structural information of biological macromolecules) in a mutual and public database that is named Protein Data Bank (PDBFootnote 1) [2].

Various bioinformatics research topics that have been investigated in the literature for analyzing proteins are presented hereafter.

Due to the increasing interest for the analysis of protein and to the development of emerging instruments and technologies, the size and the diversity of digitized protein information are more and more high making then complex the exploitation for such a database. In [5], a freely available web-based database exploration tool (PDB-ExplorerFootnote 2 website) is proposed and permits to interactively visualize and explore the structural diversity of the PDB (e.g., through color-coded map generation or structure classification).

In [14], the author tackles the problem of functional annotation from protein 3D structures for which most solutions use 3D structure superposition techniques that are computationally demanding. The author combines geometry characteristics and physicochemical features for efficiently analyzing the protein surfaces.

In [7], the authors study the problem of understanding protein-protein interactions. They propose a methodology of predicting of Hot-Spots in protein-protein interfaces. The presented model is trained on a large number of structural and evolutionary sequence-based features. Also, several classification algorithms with cost functions are utilized. The best model is selected by using c-forest, a random forest ensemble learning method.

In this paper, our goal is to present a transfer learning-based methodology for indexing protein structures represented by 3D point clouds. Indeed, a neural network training process can be computationally time consuming. Additionally, it requires the preparation of ground-truths which is a fastidious task (manual data labelling). Hence, instead of training a neural network, a pre-trained one with generic 3D objects is directly exploited to characterise protein structures. Our proposed indexing methodology is important for biologists that are searching automated solutions to find family members of a query protein or even to label new structures by directly using input raw 3D point clouds.

Fig. 1.
figure 1

Overview of our proposed transfer learning-based method.

2 Proposed Methodology

A transfer learning is an operation that consists of exploiting knowledge gained to solve a problem and applying it to solve a different but related problem. Nevertheless, efficient transfer learning needs surrounding processing stages for its adaptation to the targeted problem with respect to its applicative context. In this section, we describe the proposed methodology which is entitled “Generic Learning-based Transfer for Indexing Proteins (GLT4IP)”. It is focused on a transfer learning-based indexing method for 3D protein shape retrieval.

Figure 1 provides an overview of the associated major stages. First, the input protein which is represented in the form of a 3D point cloud is resampled and normalized. The resulting pre-processed protein data is injected into a Convolutional Neural Network (CNN) through a classification architecture that was already pre-trained onto a 3D object database. Since this database was composed of a large variety of man-made objects, it made data structures and parameters of the exploited CNN architecture (e.g., associated layers, weight coefficients) particularly tuned for classifying a large variety of object shapes. A transfer learning is then applied by extracting from this CNN architecture, for each protein, a feature vector that is globally embedding structural information of the protein with a generic manner. Finally, extracted protein feature vectors are used to compute the similarity scores from the ones to the others. A sorting of similarity scores can then permit to identify proteins having similar structural characteristics to a query protein—protein shape indexing.

2.1 Sub-sampling of the Considered Protein Point Clouds

Before to proceed to the feature extraction and in order to be able to exploit the considered CNN architecture, the 3D point cloud representing the protein surface (several thousand of points) is sub-sampled in order to reduce its size to 2048 3D points while keeping its global structure. This sub-sampling stage is done to adjust the protein data size to the size of input data that is managed by the CNN architecture. To this end, we apply a volumetric-based clustering algorithm on the original protein by exploiting a simplification method that was proposed in [1]. In particular, the minimum bounding box of the object is subdivided into a 3D voxel grid according to a leaf size parameter (voxel size). This latter parameter is set according to the targeted size of the final point cloud (2048 3D values). The resulting point cloud is then generated by calculating the centroids of the voxels containing points. The main advantage of such a transformation is its ability to preserve the global structure of the object thanks to a uniform sampling of the original surface. Additionally, it is known to be computationally fast thanks to the use of advanced data structures (see octree of the Point Cloud Library [11]).

2.2 Normalization of the Sub-sampled Protein Point Clouds

Once we obtained the sub-sampled point clouds, the next stage consists of their normalization in order to make coherent the targeted protein-to-protein comparison process. The applied normalization stage is twofold: (i) the sub-sampled 3D point clouds of proteins are spatially rescaled. To reach this goal, the object is normalized into a unit sphere corresponding to the minimal bounding sphere. This step is performed by using an algorithm which has the advantage of not being time consuming ([13] and [9]), (ii) each resulting rescaled 3D point cloud is then re-centered by computing its barycenter and by operating a zero-mean translation to its associated points (i.e. registration of the 3D points to a zero point of common XYZ referential). It is worth mentioning that the quantity of each normalized 3D protein point cloud has not changed and is still equal to 2048.

2.3 Extraction of Structural Feature Vectors

Each prepared protein 3D point cloud (natural 3D object) is then injected into a CNN architecture that was pretrained over a large database of diverse man-made 3D objects in order to benefit from a deep analyzer already calibrated with structural classification objectives (transfer learning). Indeed, deep learning architecture of these recent years are pushing the frontier of performance in many computer vision and 3D applications including data detection, segmentation and classification. Our methodology exploits the PointNet classification architecture [8] as a generic feature vector extractor.

More precisely, in our case we did not consider the output of the last layer of this architecture (i.e. classification vector). We use the pretrained network for extracting a global descriptor vector corresponding to an intermediate fully connected layer giving the best experimental performance. To reach this goal, we have conducted an empirical study to identify which layer level gives the highest performance (see the architecture layers in Fig. 2 of [8]). Consequently, the feature vectors that are generated for the prepared protein implicitly take advantage of information learned on a dataset of approximately 12,300 CAD 3D objects with 40 possible categories (details of operations and training protocols are presented in the PointNet reference).

2.4 Shape Matching

Having generated a descriptor vector for each protein, the last stage consists of measuring the protein-to-protein similarity. To this end, we experimented cost functions over the descriptor vectors, namely the Euclidean distance and the Earth Movers distance [10]. Proteins are sorted from the closest one to the furthest one with respect to each query protein (e.g.; for generating a distance matrix necessary to the object indexing). Both functions provide a dissimilarity score between two compared proteins and a 0 value output means that they are equal.

3 Experimental Results and Performance Evaluation

Our method has been experimented on the SHREC2018 protein dataset and compared to the related state-of-the-art methods [6]. The SHREC2018 protein dataset is composed of 2267 proteins. Each protein is represented by two formats, namely PDB and OFF which give a total number of 4534 files. As raised in the introduction, the PDB (Protein Data Bank) is the standard format that is used by the biologist community. This format describes the protein structure in the form of a point cloud where each point is the center of an atom. The OFF (Object File Format) format describes the surface of the protein in the form of a mesh of triangles. In this latter case, each atom is approximated by a sphere.

The 2267 proteins have been organized into 107 classes where each class represents a protein domain. The dataset has been built following a specific protocol while considering standard references including the protein structure database PDB [2] as well as the SCOPe database (Structural Classification Of Proteins - extended) [3]. For more details on the protocol followed to build the dataset, we refer the reader to the original paper [6]. Figure 2 illustrates some proteins in the OFF format. Each row shows examples of proteins belonging to the same class.

Fig. 2.
figure 2

On each row, examples of proteins belonging to the same class from SHREC2018 protein dataset.

To evaluate the performance of our method, we considered the OFF files of the 2267 proteins. For each protein, we have applied the processing pipeline described in our methodology to extract the feature vectors. As stated previously in the paper, for the feature extraction stage, we employ a transfer learning from the PointNet [8] CNN classification architecture. This allowed to generate for each protein three feature vectors corresponding to three intermediate and successively fully connected layers for which the sizes are 1024, 512 and 256, respectively.

Figure 3 shows the precision-recall curves obtained by our method for the three feature vectors and using two different distances for the shape matching step: the Euclidean distance and the Earth Movers distance. For this later, we only display the best curve obtained among the three (the one based on a vector of size 1024) for clarity’s sake. The figure clearly shows that the best retrieval results correspond to the ones calculated from feature vectors of size 1024 using Euclidean distance.

Fig. 3.
figure 3

Precision-recall curves obtained by our method with different settings.

Moreover, some other standard metrics [12] have been considered in our evaluation:

  • Nearest Neighbor (NN): the percentage of objects belonging to the query class and ranked in the top k of the retrieval result where \(k=1\).

  • First Tier (T1): the same idea as in NN where k depends on the size of the class query. If the class size is C then \(k=C-1\).

  • Second Tier (T2): in this case \(k=2*(C-1)\).

  • E-Measure (EM): the precision and recall calculated on the first 32 retrieved objects.

  • Discounted Cumulative Gain (DCG): assuming that the user pays more attention on the first displayed results of a search, this measure assigns more weight to the relevant results located at the top of the list.

All these metrics are ranged in [0, 1] where 1 indicates the best performance. Using these metrics, we compared our best results (Euclidean distance calculated on 1024 dimensional vectors) with some of the most recent methods having exploited the SHREC2018 protein dataset. More precisely, we compared our method (GLT4IP) with six methods described in [6]: 3D convolutional framework for protein shape retrieval (3D-FusionNet), Global Spectral Graph Wavelet framework (GSGW), Histograms of Area Projection Transform (HAPT), Protein Shape Retrieval driven by Digital Elevation Models (DEM), Scale-Invariant Wave Kernel Signature (SIWKS) and Wave Kernel Signature (WKS).

Table 1 summarizes the performances obtained by our method and by the six methods on the SHREC2018 protein dataset. It shows that our method GLT4IP reaches better results than GSGW, DEM and SIWKS. Three other methods outperform GLT4IP but this latter remains complementary since relatively fast outputs are obtained through the pre-trained CNN. Nevertheless, performances obtained by all current methods clearly show that characterizing the shapes of the proteins is not an obvious task, probably in reason of their high diversity and irregularity of shapes which make the current descriptors partially efficient.

Table 1. Performances of our proposed method GLT4IP compared to those of the state of the art methods obtained on the SHREC2018 protein dataset.

4 Conclusion

The paper presents an approach (GLT4IP) indexing protein structures from associated 3D point clouds. The protein data is subsampled to fit with the input size of a CNN that was already pretrained onto man-made 3D object database. The subsampling stage is performed while keeping the shape topology. By subsampling data and transferring knowledge from a pretrained CNN, it makes GLT4IP relatively fast. GLT4IP performances overpass half of the state-of-the-art methods involved in the SHREC2018 contest. GLT4IP reveals the potential of a prepared transfer learning-based method for competing with research methods in protein shape retrieval.