DeepEP: a deep learning framework for identifying essential proteins
Essential proteins are crucial for cellular life and thus, identification of essential proteins is an important topic and a challenging problem for researchers. Recently lots of computational approaches have been proposed to handle this problem. However, traditional centrality methods cannot fully represent the topological features of biological networks. In addition, identifying essential proteins is an imbalanced learning problem; but few current shallow machine learning-based methods are designed to handle the imbalanced characteristics.
We develop DeepEP based on a deep learning framework that uses the node2vec technique, multi-scale convolutional neural networks and a sampling technique to identify essential proteins. In DeepEP, the node2vec technique is applied to automatically learn topological and semantic features for each protein in protein-protein interaction (PPI) network. Gene expression profiles are treated as images and multi-scale convolutional neural networks are applied to extract their patterns. In addition, DeepEP uses a sampling method to alleviate the imbalanced characteristics. The sampling method samples the same number of the majority and minority samples in a training epoch, which is not biased to any class in training process. The experimental results show that DeepEP outperforms traditional centrality methods. Moreover, DeepEP is better than shallow machine learning-based methods. Detailed analyses show that the dense vectors which are generated by node2vec technique contribute a lot to the improved performance. It is clear that the node2vec technique effectively captures the topological and semantic properties of PPI network. The sampling method also improves the performance of identifying essential proteins.
We demonstrate that DeepEP improves the prediction performance by integrating multiple deep learning techniques and a sampling method. DeepEP is more effective than existing methods.
KeywordsDeep learning Identifying essential proteins node2vec Imbalanced learning Protein-protein interaction network Multi-scale convolutional neural networks
Area Under receiver operating characteristic Curve
Convolutional neural network
Receiver Operating Characteristic
support vector machine
Essential proteins are indispensable for organisms and play a very important role in maintaining cellular life [1, 2]. Determination of essential proteins not only helps us understand the basic requirements of a cell at a molecular level, but also helps identifying essential genes and finding potential drug targets. Thus identifying essential proteins is very important for researchers. There are several biological experimental methods to identify essential proteins, such as RNA interference , conditional knockout , and single gene knockout . But these methods require lots of resources and time. Moreover, in some complex organisms, these methods are not always applicable. Considering these experimental constraints, it is appealing to develop an accurate and effective computational approach for identifying essential proteins.
Existing computational approaches can be roughly divided into two categories: centrality methods and shallow machine learning-based methods. Jeong et al.  proposed centrality-lethality rule which point out that the highly connected proteins in a PPI network tend to be essential. Based on this rule, a lot of centrality methods have been proposed [7, 8, 9, 10, 11, 12]. Meanwhile, researchers began to integrate more different useful biological information to identify essential proteins. A lot of different types of biological information, such as gene expression profiles [13, 14], subcellular localization information [15, 16], protein domains , orthologous information [18, 19], GO annotation and RNA-Seq data , have been used in various studies.
With the rapid development of high-throughput sequencing technique, we can easily get a lot of biological data which provide a solid foundation of using machine learning methods . Generally, researchers develop a machine learning method for prediction according to the following steps: select some useful features (in this case, topological features of a PPI network), construct training and testing datasets, select an appropriate machine learning algorithm, and evaluate the performance of the algorithm. A number of shallow machine learning-based methods including support vector machine (SVM) , ensemble learning-based model , Naïve Bayes , decision tree  and genetic algorithm , are wildly used in identification of essential proteins.
Both centrality methods and shallow machine learning-based methods perform well, but each has some limitations. For centrality methods, current methods predict essential proteins by using a function to characterize the topological features of PPI networks according to their prior domain knowledge. But when the PPI network is very complicated (such as thousands of proteins and tens of thousands of protein-protein interactions), the function cannot characterize the topological features of such a complicated PPI network due to the output of the function is just a scalar [27, 28]. For shallow machine learning-based methods, the first step is selecting features. They usually select features by manual feature selection, which may pose a theoretical limitation to explain why these topological features are chosen in this study and depend heavily on the prior knowledge of researchers. In addition, identifying essential proteins is an imbalanced learning problem due to the number of non-essential proteins is much larger than the number of essential proteins. Data imbalance usually hinders the performance of machine learning methods, but few current shallow machine learning-based methods are designed to handle the imbalanced learning in essential proteins prediction.
To tackle the above limitations and further improve machine learning methods for identifying essential proteins, we propose DeepEP, a deep learning framework for identifying essential proteins. Recently, deep learning methods have been applied to represent network information and learn network topological features. They achieve the state-of-the-art performance in lots of applications [29, 30]. Inspired by their success, we aim to investigate whether deep learning methods could achieve notable improvements in the field of identifying essential proteins as well. We believe that deep learning techniques can be used to obtain better representation and thus improve performance. In particular, we employ the node2vec technique to encode a PPI network into a low-dimensional space, and then learn a low-dimensional dense vector for each protein in the PPI network. The low-dimensional dense vector represents the topological features of the corresponding protein. Using the node2vec technique has two advantages: (i) it provides a vector representation for a protein, this vector has a richer representation for topological features of a PPI network than a scalar; (ii) the node2vec technique can automatically learn vector representations from a PPI network and thus not require to choose some topological features. In addition, we use a sampling method to alleviate the imbalanced learning problem. The sampling method samples the same number of the negative samples (non-essential proteins) and positive samples (essential proteins) in a training epoch, and thus ensures the results are not biased to any class in training process. We use this strategy in many training epochs and can make full use of all non-essential proteins to train DeepEP with a high probability. In addition to overcoming the above limitations, DeepEP also uses other deep learning techniques to improve prediction performance. In this study, we use a PPI network dataset and gene expression profiles for training. For gene expression profiles, we transform them to images and thus we can use some deep learning techniques to better extract their patterns. Multi-scale convolutional neural network (CNN) is a newly developed deep learning architecture and is powerful for pattern extraction. We utilize it to extract more effective patterns of gene expression profiles.
To demonstrate the effectiveness of DeepEP, we perform extensive experiments on S. cerevisiae dataset. The experimental results show that DeepEP achieves better performance than traditional centrality methods and outperforms the shallow machine learning-based methods. To discover the vital element of DeepEP, we compare the results obtained by node2vec technique with those of 6 central methods. Detailed ablation study shows that the dense vectors which are generated by node2vec technique contribute a lot to the improved performance. Additionally, the sampling method also helps to improve the performance of identifying essential proteins.
Materials and methods
Network representation learning
As mentioned in the previous section, researchers need to select some useful features to accomplish the development of machine learning approach. Selecting PPI topological features is a very critical step in the study. Over the past 10 years, researchers proposed many effective computational methods to predict essential proteins based on network topological features such as DC, BC, CC, EC and so on. However, it is still difficult to select some centrality indexes from them. Traditional feature selection method used in identifying essential proteins is manual feature selection. There are two disadvantages in manual feature selection. The first one is that we have to must lots of prior knowledge about essential proteins. The second one is the selected topological feature is a scalar which cannot represent the complex topological features of a PPI network. To address the two issues, we use network representation learning technique to obtain biological features from a PPI network. Different from manual feature selection, network representation learning can automatically learn a low-dimensional dense vector for each protein in the biological network to represent the semantic and topological features. By using this technique, a dense vector which has more powerful representation than a scalar can be obtained and thus, it can improve the performance .
Various network representation learning techniques have been proposed in recent years . Specifically, we used the node2vec technique  which can learn dense vector representations of vertexes in network based on deep learning methods. It uses biased random walk algorithm to generate a corpus which consists of every vertex’s sequence for training, and aims to predict the context of the given center node by maximizing the co-occurrence likelihood function. The node2vec technique can explore different types of networks and obtain richer topological representation of the network than traditional methods.
Data imbalance is a very common phenomenon in real-world and we must take it into consideration in machine learning field. The imbalance problem is encountered in prediction of essential proteins. The classes that have more data instances are defined as the majority class, while the ones with fewer instances are the minority class. In the essential proteins dataset we used, the essential proteins belong to the minority class and non-essential proteins belong to the majority class. The imbalanced nature of data poses a challenge for identifying essential proteins. Most traditional machine learning methods usually bias towards the majority class and hence lead to loss of predictive performance for the minority class. Here our focus is to identify the essential proteins out of many non-essential ones, which requires us to tackle the problem of data imbalance effectively.
Previous studies have made great efforts to alleviate the imbalanced data learning problem. Sampling methods are the most wildly used and very effective methods [34, 35, 36]. However, we cannot direct use traditional sampling methods (random oversampling and SMOTE) in DeepEP due to the high consumption of computer resources. The vector which is fed to the classification module is a high-dimensional vector, and we do not want to synthesize any new samples for training based on the raw high-dimensional vector.
In this study, we set α =0.001, the training times k can be determined by Eq. (2).
In order to better capture the patterns of gene expression profiles, we treat them as images. A gene expression profile has three successive metabolic cycles and each cycle has 12 time points. It is natural to regard one gene expression profile as an image with 1 channel * 3 rows * 12 columns, and thus some related techniques in computer vision can be applied in feature extraction for essential proteins prediction. Deep learning techniques have been successfully applied in computer vision and CNN is the most wildly used network architecture. CNN uses convolutional filters to extract local features  from raw images and multi-scale CNN uses different kernels to extract local contextual features . By using different kernels, we obtain different information of different spatial scales. The combination of the information from the different scales can help to improve the prediction task. Figure 1 shows the illustration of how a gene expression profile is treated as an image.
AUC is defined as the area under the Receiver Operating Characteristic (ROC) curve and ROC curve is a commonly used tool of visualizing performance of a classifier. AP score is defined as the area under the precision-recall (PR) curve and this assessment metric is widely used for evaluating identification of essential proteins. Note that F-measure, AUC, and AP score are more important than accuracy, precision and recall in an imbalanced learning problem due to they can offer a comprehensive assessment of a machine learning classifier.
We use three kinds of biological datasets in our experiments: PPI network dataset, essential proteins dataset, and gene expression profiles. The PPI network dataset is collected from BioGRID database . To eliminate the noise of the dataset, we removed self-interactions and repeated interactions. There are 5616 proteins and 52,833 protein-protein interactions in the preprocessed PPI network dataset. The essential proteins dataset is collected from the four databases: MIPS , SGD , DEG , and SGDP. We removed some overlap proteins and integrated the information of the four databases. The preprocessed dataset of essential proteins contains 1199 essential proteins. The gene expression profiles dataset is collected from GEO database (accession number: GSE3431). It consists of 6776 gene products (proteins) and 36 samples. There are three successive metabolic cycles and each cycle has 12 time points.
Results and discussion
In our experiments, we first employ the node2vec technique to generate network representation vectors. Each protein in PPI network is represented by a 64-dimensional vector. Our deep learning framework is implemented by Tensorflow which is a wildly used deep learning system [43, 44]. Multi-scale CNN layers with kernel size 1, 3, and 5 are utilized to extract contextual features of gene expression profiles. By using multi-scale CNN layer we obtain 3 feature maps, each having 8 channels. These feature maps are concatenated together as the extracted contextual feature vector. Then the output of multi-scale CNN layer is fed to the maxpooling layer. After maxpooling layer, the output vectors and network representation vectors generated by node2vec are concatenated, and then the concatenated vector is fed to a fully connected layer which has 312 nodes with ReLU activation function. To avoid overfitting, a dropout rate of 0.1 is applied in DeepEP on fully connected layer. Finally, we train our deep learning framework using the Adam optimizer. The batch size is set to 32 and initial learning rate is set to 0.001.
Comparison with other centrality methods
Comparison with shallow machine learning-based methods
Machine learning-based methods are widely used in predicting essential proteins. SVM and ensemble learning-based model are the two most commonly used shallow machine learning-based methods. Besides, decision tree and Naïve Bayes are very popular methods. Thus these shallow machine learning methods (SVM, ensemble learning-based model, decision tree, Naïve Bayes) are compared to DeepEP. All of these shallow machine learning methods are implemented by scikit-learn python library with default parameters. We shuffle all samples in raw dataset and then split raw dataset into training dataset and testing dataset. Training dataset is composed of 80% samples of raw dataset and the rest samples constitute testing dataset. In both the training and the testing datasets, the ratio of positive samples (essential proteins) and negative samples (non-essential proteins) remains the same. We use two different ways to compare the machine learning-based methods. First, we directly utilize raw training dataset for training and testing on testing dataset. Second, we first apply the random undersampling technique to draw M (number of essential protein samples) samples from non-essential protein set of training dataset. Then we combine the selected non-essential proteins and all essential proteins together as input data to train machine learning models. The overall performance of all machine learning and deep learning algorithms are evaluated using testing dataset. To ensure a fair comparison, the input features are the same.
Performance of DeepEP and other shallow machine learning–based methods with different ratios
Machine learning algorithms
SVM (raw dataset)
Decision tree (raw dataset)
Decision tree (1:1)
Random forest (raw dataset)
Random forest (1:1)
Adaboost (raw dataset)
Naïve Bayes (raw dataset)
Naïve Bayes (1:1)
Performances of DeepEP and comparing models (using gene expression profiles combined with different central indexes (DC, CC, EC, BC, NC, and LAC))
Gene expression + DC
Gene expression + CC
Gene expression + EC
Gene expression + BC
Gene expression + NC
Gene expression + LAC
Gene expression + node2vec
Performance of DeepEP and comparing methods (models with different ratios (1:1, 1:1.5, 1:2, 1:2.5 and 1:3) and a model which uses raw dataset for training)
Ratios (Essential VS non-essential)
We propose a new deep learning framework, DeepEP, which is used for identifying essential proteins. DeepEP aims to investigate whether deep learning and sampling methods could achieve notable improvements for identifying essential proteins. The topological features of PPI networks are difficult captured by traditional methods. DeepEP utilizes the node2vec technique to automatically learn complex topological features from PPI network. The node2vec can project the PPI network to low-dimensional space and obtain the representation of proteins with low-dimensional vectors, which allow DeepEP to address the limitations of the traditional methods. In addition, the essential proteins prediction is an imbalanced learning problem; a sampling method is applied in DeepEP to handle this issue. The experimental results obtained by DeepEP show that the proposed approach is able to achieve the state-of-the-art performances that are higher than those obtained by other centrality methods and shallow machine learning-based methods. To understand why DeepEP works well for identifying essential proteins, we conduct studies by substituting node2vec technique with 6 common used central indexes and the proposed sampling method with different ratios. Experimental results show that the dense vectors which are generated by node2vec technique contribute a lot to the improved performance. In addition, the sampling method also helps to improve the performance of deep learning framework.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 16, 2019: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2018: bioinformatics and systems biology. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-16.
MZ and ML conceived and designed the experiments. MZ performed the experiments. MZ and ML drafted the manuscript. FXW, YL, and YP revised the manuscript. All authors approved the final manuscript.
This work was supported in part by the National Natural Science Foundation of China under Grants (No. 61832019, No. 61622213 and No. 61728211), Hunan Provincial Science and Technology Program (No. 2018WK4001), the Fundamental Research Funds for the Central Universities of Central South University (No. 502221903), and the 111 Project (No.B18059 and No.G20190018001). The publication costs of this article were funded by the National Natural Science Foundation of China under Grant No. 61832019.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 8.Joy MP, Brock A, Ingber DE, Huang S. High-betweenness proteins in the yeast protein interaction network. Biomed Res Int. 2005;2005(2):96–103.Google Scholar
- 12.Li G, Li M, Wang J, Li Y, Pan Y. United neighborhood closeness centrality and orthology for predicting essential proteins. IEEE/ACM Trans Comput Biol Bioinform. 2018. https://doi.org/10.1109/TCBB.2018.2889978.
- 15.Zhang J, Li W, Zeng M, Meng X, Kurgan L, Wu F, Li M. NetEPD: a network-based essential protein discovery platform. Tsinghua Sci Technol. 2019. https://doi.org/10.26599/TST.2019.9010056.
- 16.Zeng M, Li M, Fei Z, Wu F, Li Y, Pan Y, Wang J. A deep learning framework for identifying essential proteins by integrating multiple types of biological information. IEEE/ACM Trans Comput Biol Bioinform. 2019. https://doi.org/10.1109/TCBB.2019.2897679 .
- 21.Li X, Li W, Zeng M, Zheng R, Li M. Network-based methods for predicting essential genes or proteins: a survey. Brief Bioinform. 2019. https://doi.org/10.1093/bib/bbz017.
- 27.Li M, Gao H, Wang J, Wu F. Control principles for complex biological networks. Brief Bioinform. 2018. https://doi.org/10.1093/bib/bby088.
- 31.Tu C, Zhang W, Liu Z, Sun M. Max-margin DeepWalk: discriminative learning of network representation. In: IJCAI; 2016. p. 3889–95.Google Scholar
- 32.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. p. 3111–9.Google Scholar
- 33.Grover A, Leskovec J. node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; 2016. p. 855–64. https://doi.org/10.1145/2939672.2939754.
- 34.He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84.Google Scholar
- 35.Zeng M, Zou B, Wei F, Liu X, Wang L. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. In: 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS). Chongqing: IEEE; 2016. p. 225–8. https://doi.org/10.1109/ICOACS.2016.7563084.
- 37.Zeng M, Zhang F, Wu F, Li Y, Wang J, Li M. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz699.
- 43.Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M. Tensorflow: a system for large-scale machine learning. In: OSDI; 2016. p. 265–83.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.