Recent developments of sequence-based prediction of protein–protein interactions

Murakami, Yoichi; Mizuguchi, Kenji

doi:10.1007/s12551-022-01038-1

Recent developments of sequence-based prediction of protein–protein interactions

Review
Published: 24 December 2022

Volume 14, pages 1393–1411, (2022)
Cite this article

Download PDF

Biophysical Reviews Aims and scope Submit manuscript

Recent developments of sequence-based prediction of protein–protein interactions

Download PDF

Yoichi Murakami¹ &
Kenji Mizuguchi^2,3

4085 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

The identification of protein–protein interactions (PPIs) can lead to a better understanding of cellular functions and biological processes of proteins and contribute to the design of drugs to target disease-causing PPIs. In addition, targeting host–pathogen PPIs is useful for elucidating infection mechanisms. Although several experimental methods have been used to identify PPIs, these methods can yet to draw complete PPI networks. Hence, computational techniques are increasingly required for the prediction of potential PPIs, which have never been seen experimentally. Recent high-performance sequence-based methods have contributed to the construction of PPI networks and the elucidation of pathogenetic mechanisms in specific diseases. However, the usefulness of these methods depends on the quality and quantity of training data of PPIs. In this brief review, we introduce currently available PPI databases and recent sequence-based methods for predicting PPIs. Also, we discuss key issues in this field and present future perspectives of the sequence-based PPI predictions.

An Overview of Scoring Functions Used for Protein–Ligand Interactions in Molecular Docking

Article 15 March 2019

A review of machine learning-based methods for predicting drug–target interactions

Article 12 April 2024

Advances in Structural Bioinformatics

Introduction

Proteins are biological macromolecules composed of one or more chains of amino acid residues and play many critical roles in living cells, participating in a variety of biological processes, such as catalyzing chemical reactions, synthesizing and repairing DNA, and receiving and sending chemical signals. In order to perform their biological functions, they interact with other molecules such as ions, membrane lipids, DNA, and proteins by making direct physical contact through their specific residues. Interactions between proteins, i.e., protein–protein interactions (PPIs), are essential to the formation of macromolecular structures and to almost every biological process (Braun and Gingras 2012; Caterino et al. 2017; Dos Santos Vasconcelos et al. 2018; Liu et al. 2019). Those interactions are made through non-covalent contacts, electrostatic forces, or hydrophobic effects, between specific residues on proteins (De Las and Fontanillo 2010).

The identification of essential proteins required for the survival and development of the cell is important in understanding cell life and will help us better understand diseases and develop new drugs. Also, since almost every biological process involves one or more PPIs, the identification of PPIs in an organism is useful for understanding the molecular mechanisms underlying specific biological processes and for elucidating biological functions of proteins. In addition, comprehensive PPI networks associated with normal and abnormal physiological conditions are necessary not only for understanding physiological mechanisms but also for drug discovery for specific disorders, such as neurological disorders including Alzheimer disease and Creutzfeldt-Jacob disease (Qi et al. 2006; von Mering et al. 2002; Pedamallu and Posfai 2010). Furthermore, the identification of interspecies PPIs, such as virus-host PPIs, is also useful for understating infection mechanisms and for the design of new antiviral drugs and the treatment of infected patients. For example, studies of PPI networks of SARS-CoV-2 and (H1N1) influenza, which have similar clinical symptoms, have shown that virus-human PPIs are involved in multiple heterogeneous processes, including protein trafficking, translation, transcription, and ubiquitination (Khojasteh et al. 2022). These studies can help reveal similarities and difference between the two viruses.

There are two types of experimental methods for identifying PPIs: large-scale and high-throughput experiment methods and target-specific methods. The former screens large-scale PPIs by expressing each protein and exhaustively probes interactions between proteins of interest, such as yeast two-hybrid system, tandem affinity purification mass spectrometry, protein chip technology, and phage display. The latter determines a complex structure of a specific PPI of interest, such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy. The latter can determine interactions at the atomic level.

Although many experimental methods have been developed to identify PPIs, our knowledge of the whole set of PPIs in a particular cell or organism, i.e., interactome, is still incomplete, as numerous factors affect the detection of potential PPIs by those experiments; key factors include transient interactions, post-translational modifications (PTMs) (Seet et al. 2006; Duan and Walther 2015), intrinsically disordered regions (Acuner Ozbabacan et al. 2011; Lua et al. 2014; Meszaros et al. 2009; Babu et al. 2012), and other physiological conditions. Moreover, experimental methods are costly, time-consuming, and labor-intensive. Hence, the development of reliable computational methods for predicting PPIs is required. These computational methods can confirm or supplement experimentally detected PPIs, and add extra information, allowing us, for example, to prioritize therapeutically relevant PPIs as a target or an off-target.

There are mainly two types of computational methods for predicting PPIs: data-driven methods and molecular docking. The former predicts PPIs based on various features of protein pairs, such as interlog (Yu et al. 2004), the protein sequences (Huang et al. 2016; Eid et al. 2016), physicochemical properties (Romero-Molina et al. 2019), evolutionary profiles (Hamp and Rost 2015), and structural information (Zhang et al. 2012), with statistical models or machine learning (ML), which discover relations among the training data of known PPIs. Various ML algorithms have been used in this field, such as random forest (RF), support vector machine (SVM), and ensemble classifiers. In recent publications, deep learning (DL), which is a subset of ML methods based on artificial neural networks, has been recognized as a powerful technique through benchmarking on blind data sets. The second type of methods, molecular docking, searches for the potential binding mode of proteins with surface complementarity and interaction energies (Pierce et al. 2014; Pierce and Weng 2007; Ohue et al. 2014). AlphaFold-Multimer is an extension of AlphaFold 2.0, specifically built for predicting protein complexes with high accuracy (Evans et al. 2022). AlphaFold 2.0 is the first computational method capable of predicting monomeric protein structures with near-experimental accuracy (Jumper et al. 2021). These methods may be considered reliable tools for the prediction of protein structures or protein complexes and will be implemented for the prediction of multimeric protein complexes (Al-Janabi 2022). AlphaPulldown is a Python package for screening PPIs and high-throughput modeling of higher-order oligomers using AlphaFold-Multimer (Yu et al. 2022). However, AlphaPulldown requires structural templates for each query protein, and we still need to know which protein pairs will interact without depending on the presence or absence of protein structures or structural templates. Furthermore, proteins are dynamic and can change their conformations, for example, under different pH conditions (Warwicker 2022). Therefore, the data-driven methods for predicting PPIs, especially sequence-based PPI predictions, which allow exhaustive PPI searches, will continue to be important. In this brief review, we will focus on this type of methods. Below, we will introduce currently available PPI databases and recent sequence-based methods for predicting PPIs. Also, we will discuss key issues in this field and present future perspectives of the sequence-based PPI predictions.

PPI databases

The development of reliable methods for predicting PPIs requires a diverse and representative dataset of known interacting protein pairs. PPIs determined experimentally have been registered as computer-readable data in a database to use in biochemical studies. There are mainly two types of PPI databases: primary databases and secondary databases.

The primary databases collect experimentally derived data which are submitted directly from researchers or collected from peer-reviewed publications. For example, DIP (Database of Interacting Proteins) stores experimentally determined PPIs, both manually curated by expert curators and automatically curated using computational approaches (Salwinski et al. 2004). IntAct (Orchard et al. 2014) and BioGRID (Oughtred et al. 2021) are open-source and comprehensive PPI databases and provide analysis tools for molecular interaction data. IntAct also provides high-quality negative PPIs and several disease-specific datasets, such as interactions associated with Alzheimer’s disease and interactions investigated in the context of cancer or coronavirus. These data are useful and unique among other PPI databases. BioGRID also includes chemical interactions between genes/proteins and bioactive small molecules, and post-translational modifications (Oughtred et al. 2019, 2021).

On the other hand, the secondary databases comprise PPI data derived from numerous primary or other secondary databases using rigorous computational approaches. For example, HitPredict is a database of experimentally determined PPIs from IntAct, BioGrid, MINT (Chatr-aryamontri et al. 2007), DIP, MatrixDB (Clerc et al. 2019), and InnateDB (Breuer et al. 2013), with confidence scores assigned (Lopez et al. 2015). Those scores are calculated based on the experimental details of each interaction and the sequence, structure, and functional annotations of the interacting proteins. PINA (Protein Interaction Network Analysis) platform is a database that integrates PPIs from IntAct, BioGrid, MINT, DIP, and HPRD and provides a variety of web tools to construct, filter, and analyze the networks of proteins of interest (Du et al. 2021). APID (Agile Protein Interaction Data Analyzer) (Alonso-Lopez et al. 2019) provides a collection of known experimentally validated PPIs for more than 400 organisms from DIP (Salwinski et al. 2004), IntAct, MINT, HPRD (Keshava Prasad et al. 2009), BioGRID, BioPlex (Huttlin et al. 2015), and also from experimentally resolved 3D structures, PDB (ww 2019) and PDBsum (Laskowski et al. 2018), indicating different quality levels, i.e., whether interactions are proven by at least one binary detection method or not. This database also provides an interactive data visualization web tool that allows the construction of subinteractomes from query lists of proteins and the exploration and analysis of the corresponding networks about PPIs of interest. HIPPE (Human Integrated Protein–Protein Interaction rEference) provides functionally annotated human PPIs integrated from 10 primary databases and manually curated PPI data with the confidence scoring of experimentally measured interactions (Alanis-Lobato et al. 2017). The parameters of this scoring scheme were jointly optimized by human experts and a computer algorithm. STRING provides known PPIs including direct (physical) and indirect (functional) associations and provides predicted PPIs from automated text-mining of the scientific literature, conserved co-expression, and genomic context predictions (Szklarczyk et al. 2021). In this database, each PPI is annotated with various scores computed based on interaction evidence from the organism of interest or systematic transfers of interaction evidence from one organism to another.

In addition, there is a database called Negatome for experimentally supported non-interacting protein pairs (non-PPIs) collected by manual curation of the literature and computational analysis of protein complexes registered in PDB, excluding interactions from IntAct (Blohm et al. 2014). This database is especially important for training PPI prediction algorithms because it is complementary to the negative data generated by other methods such as randomly selecting proteins from different cellular locations.

Furthermore, there is an international collaboration, IMEx (International Molecular Exhange Consortium), between several institutions providing PPI data in order to develop a single set of curation rules for the registration of PPI data derived from experimentally derived data, pre-prints, and peer-reviewed publications and to standardize the data formats of PPI data (Orchard et al. 2012).

The databases and the number of PPIs registered are listed in Table 1.

Table 1 Currently available primary and secondary PPI databases and non-PPI database (as of November 2022)

Full size table

Preparation of PPI datasets

Preparation of a high-quality dataset is crucial for the sequence-based PPI prediction. The experimentally determined PPIs sourced from the primary or secondary databases shown in Table 1 are normally merged into a set of PPIs as positive samples, excluding interactions between similar proteins. On the other hand, the preparation of non-PPIs can be more important than that of PPIs, because the quality and quantity of non-PPIs influence the PPI predictions significantly. Most PPI prediction methods require training with both positive and negative samples. One simple method is to generate negative samples by randomly pairing proteins in the positive samples and ignoring the actual interactions, assuming that randomly paired proteins are unlikely to be positive samples (dissimilarity negative sampling). Methods with more realistic considerations have been proposed. For example, Hamp and Rost (2015) generated negative samples by randomly sampling from all the pairs in each of the four PPI datasets: one training dataset and three testing datasets C1–C3 (C1, test pairs sharing both proteins with the training dataset; C2, test pairs sharing only one protein with the training dataset; and C3, test pairs sharing neither protein with the training dataset). The need to distinguish between these classes C1–C3 was introduced by Park and Marcotte (2012). Sun et al. (2017) generated negative samples by paring proteins found in different subcellular locations, excluding proteins annotated with ambiguous or uncertain subcellular location terms and with two or more locations.

The size of the negative samples and the size balance between positive and negative samples is one issue that should be carefully considered in developing an accurate and reliable method to predict potential PPIs. One common solution is to randomly sample negative samples, keeping a ratio of positive and negative samples. Hamp and Rost (2015) sampled 10 times as many negative samples as positive samples. However, the data imbalance is an issue that needs to be discussed and solved. More details of training datasets, test datasets, and independent test datasets used in recently developed methods and their data sources are shown in Table 2.

Table 2 Available PPI datasets and independent test datasets used in PPI prediction methods developed within the last 3 years (after 2019)

Full size table

Sequence-based prediction of PPIs

The sequence-based prediction of PPI refers to the problems of inferring, given a pair of protein sequences, the likelihood of an interaction between them, i.e., a score that represents their interacting probability. This approach can be applied to the inference of PPI networks by adding new nodes and new edges to the PPI network graph (Murakami et al. 2017; Tripathi et al. 2019). So far, many computational methods to solve this problem have been proposed as complementary to experimental methods. Even though these sequenced-based methods are less accurate than structure-based methods, they are useful in predicting PPIs involving proteins, for which structural information is unknown or which are intrinsically disordered proteins. In addition, primary structures are available for all proteins, and thus, modeling and predicting PPIs using only sequence information has long been of interest. Almost all these methods are data-driven methods and can be categorized into statistical methods, similarity-based methods, and ML-based methods; however, most of the methods used in recent years are based on similarity or ML. Currently available web servers or downloadable programs for PPI prediction are shown in Table 3, along with a brief description of their strengths and weaknesses. In addition, the reported benchmark results of PPI prediction methods developed within the last three years are shown in Table 4; however, a fair performance comparison is difficult due to the different test datasets.

Table 3 Currently available web servers or downloadable programs for PPI prediction, and their strengths and weaknesses

Full size table

Table 4 The reported benchmark results of PPI prediction methods developed within the last 3 years (after 2019)

Full size table

Statistical methods

The statistical methods generally employed the statistical characteristics or the conserved patterns of protein sequences, assuming that functionally important proteins are conserved across organisms, such as the topological similarity between phylogenetic trees of a pair of proteins (Pazos and Valencia 2001), the co-occurrence of a fine number of short polypeptide sequences observed in known interacting protein pairs (Pitre et al. 2006), and the co-evolutionary divergence based on the assumption that protein pairs with similar substitution rates are likely to interact with each other (Hsin Liu et al. 2013). MirrorTree is a currently available server used to detect the coevolution between proteins and predicts their physical interactions (Ochoa and Pazos 2010). The underlying principle behind this method is that the co-evolution between interacting proteins can be reflected from the similarity scores from the distance matrices of the corresponding phylogenetic trees of the interacting proteins (Craig and Liao 2007).

Similarity-based methods

The similarity-based methods basically employed homologous interactions, in which two PPIs are homologous if a pair of interacting proteins is homologous to a pair of other interacting proteins. The homologous interactions basically include, but are not limited to, orthologous interactions (homologous interactions found in different organisms), i.e., interolog (Walhout et al. 2000), and paralogous interactions (homologous interactions in the same organisms). For example, BIPS (Biana Interolog Prediction Server) (Garcia-Garcia et al. 2012) is based on interolog information, assuming that the homologous proteins preserve similar functional behavior and also the same interactions (Matthews et al. 2001; Yu et al. 2004), and predicts interactions between proteins based on PPIs found in several PPI-related databases integrated using the BIONA (Biologic Interactions and Network Analysis) framework (Garcia-Garcia et al. 2010). SPRINT (Li and Ilie 2017) and PIPE4 (Dick et al. 2020) are based on the idea that a pair of query proteins (X₁, X₂) has an interaction if X₁ and X₂ are similar to either of the known interacting protein pair (P₁, P₂); that is, X₁ is similar to P₁ and X₂ is similar to P₂. However, these methods do not always work well in the absence of known interacting protein pairs with high-sequence similarity to the query protein pairs.

ML-based methods

The ML-based methods employ various supervised ML algorithms, such as SVM, RF, and DL. These algorithms are used for most of the existing PPI prediction methods. SVM aims to find a maximum margin hyperplane in an n-dimensional space (n is the number of features) that separates the labelled samples, i.e., maximizing the distance between samples of different classes. SVM requires computing power to train and test high-dimensional features with radial basis function (RDF) kernel that transforms linearly inseparable samples to linearly separable ones (kernel trick). RF is an ensemble learning method involving numerous decision trees (DT) for classification and outputs the class selected by most trees. RF can effectively train a large dataset of PPIs and vectors with many features and can rank the feature importance for accurate prediction. To train an RF model, the optimal value of the number of trees in the forest is usually adjusted, concerning the computational time and the accuracy. DL is an artificial neural network with multiple layers between the input and output layers. DL is considered to achieve better performance than the conventional ML-based methods in the PPI predictions. DL consists of several algorithms, such as Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Graph Convolutional Networks (GCN). These different algorithms have been applied to the PPI predictions and require the different input forms of proteins. DNN requires a one-dimensional vector, while other algorithms input flexible forms, for example, a two-dimensional matrix such as position-specific scoring matrix (PSSM) (Hu et al. 2022). Recently developed DL-based methods are shown in Table 3.

Protein feature encoding

One important issue for the ML-based methods is how to encode protein sequences of variable lengths into fixed-length numeric feature vectors used for the development of the prediction models and the prediction of PPIs. A pair of feature vectors encoding a protein pair is generally inputted to the method either by combining them sequentially or separately. In addition, the extraction of appropriate features from protein sequences is critical for the accurate PPI prediction.

Commonly used protein feature encoding methods includes physicochemical properties of amino acids (AA) (Bock and Gough 2001; Sun et al. 2017), protein sequence profiles (evolutionary profiles) (Liu et al. 2019; Hashemifar et al. 2018; Hamp and Rost 2015), and protein sequence embedding (Alachram et al. 2021).

Various physicochemical properties of AA are used, such as hydropathy index (hydrophobic or hydrophilic properties of an AA side chain), positively or negatively charged AA, uncharged AA, and pKa value (the acid dissociation constant at logarithmic scale which is a quantitative measure of the strength of an acid in solution). These properties are available in the AAindex (https://www.genome.jp/aaindex/) database (Kawashima and Kanehisa 2000; Kawashima et al. 2008).

Protein sequence profiles are a list of preferences for each AA at each position in a given multiple sequence alignment (MSA), i.e., a PSSM, which is derived from MSA in position-specific iterative BLAST (PSI-BLAST) (Altschul et al. 1997). PSSM is informative protein feature based on evolutionary information extracted from MSA, even though an enormous search time to compute it is required by PSI-BLAST. In addition, PSSM can reproduce evolutionary conserved interactions between protein sequences through their evolutionary information. For example, DPPI (Hashemifar et al. 2018) and TransPPI (Yang et al. 2021) employ PSSM which is a \(N\times 20\) matrix M = {M_ij, i = 1…N, j = 1…20}, where N is the length of a given protein sequence and each element M_ij is the score of the j_th AA in the i_th position of the sequence.

Protein sequence embedding captures semantic information on AA residues in entire sequences. The widely used embedding methods, such as Word2Vec (Mikolov et al. 2013a) and Doc2Vec (Le and Mikolov 2014), was originally developed in the field of natural language processing in order to obtain the distributed representation of words and documents. These methods are learned from the contexts of words in each document using a shallow two-layer neural network with continuous bag-of-words model (CBOW) and the continuous skip-gram model (Skip-Gram). CBOW predicts the target words from the context of documents, while Skip-Gram predicts the target context from entire words. CBOW learns faster while Skip-Gram does more efficient with small amounts of training data and has better representations for infrequent words (Mikolov et al. 2013b). In the biological sequences, a sequence is regarded as a sentence and represented by multiple k consecutive AA (k-mers) used to train the Word2Vec or Doc2Vec models. These methods were recently applied to the prediction of human and virus protein interactions, showing that they learned the protein features well that enable the reliable prediction of huma-virus PPIs (Yang et al. 2020b; Tsukiyama et al. 2021). Furthermore, several new residue representation methods based on Word2Vec have been proposed, such as Res2vec (Yao et al. 2019) and DL2vec (Chen et al. 2021).

Current issues

False positives, overfitting, and underfitting

Prediction models generated are usually evaluated with several performance measures to compare predicted scores with the actual observed ones. To minimize false positives and false negatives, it is necessary to use performance measures that can appropriately evaluate predictive performance, even though it is difficult to distinguish between false positives and novel PPIs.

Overfitting and underfitting issues can occur when generating predictive models. The performance of a predictive model generated by a learning algorithm depends on how well it captures the underlying features of the training dataset. When the algorithm is too simple and insufficient to model the training dataset, it is called underfitting, whereas when the algorithm is too complex and the training dataset are not sufficient to constrain it, it is called overfitting (Sarkar and Saha 2019). These issues can be solved by selecting an appropriate learning algorithm, performing appropriate evaluation with independent blind test datasets, and preparing a training dataset without missing underlying features of PPIs or bias. In terms of training dataset preparation, to solve the overfitting, it is still important to appropriately reduce the redundancy of homologous sequences present in the dataset. CD-HIT (Fu et al. 2012) is generally used for clustering and comparing protein sequences with several options, such as a sequence identity threshold or an alignment coverage, to reduce redundancies in the dataset. The main advantage of this program is that it is very fast and can handle extremely large datasets. Attention should be paid to the appropriate setting of the options used in CD-HIT. Datasets containing proteins with higher sequence similarity (e.g., > 30%) may lead to overfitting and overestimate the prediction performance. To avoid overfitting and develop reliable prediction models, it is expected to use low-sequence identity cut-off (< 30%) widely used in various sequence-based methods, shown in Table 2. On the other hand, to solve the underfitting, sufficient and up-to-date PPI data that reflect the features of target PPIs should be collected from the existing databases, shown in Table 1.

ML-based methods, realistic datasets

In the ML-based prediction methods, it is important to select the optimal ML approach. Currently, various ML-based methods have been developed that differ in terms of protein representation, method complexity, various protein features, and computational cost. DL-based methods have reported higher prediction performance than other methods but are computationally expensive.

Most methods have been developed on a balanced dataset containing equal numbers of interacting protein pairs and non-interacting protein pairs or an imbalanced dataset containing non-interacting pairs several times greater than the number of interacting pairs. However, this data balance, i.e., the ratio of interacting and non-interacting protein pairs, is highly imbalanced in nature and few methods have been developed or evaluated regarding the impact of realistic datasets containing huge numbers of non-interacting pairs. The realistic datasets should be huge, so handling such datasets requires more efficient algorithms for the prediction of PPIs.

Future perspectives and conclusion

Despite advances in techniques for large-scale experimental analysis of protein interactions, our knowledge of the whole set of PPIs in a particular cell is still incomplete, considering various physicochemical factors such as transient dynamics, post-translational modification (PTM), intrinsically disordered regions, and physiological conditions. For example, a comprehensive understanding of liquid–liquid phase separation (LLPS), which has received increasing attention over the past decade, should require the construction of complete PPI networks involving phase-separated proteins and analyze the network properties of different classes of phase-separating proteins in the human interactome. LLPS is an important mechanism that drives the formation of membrane-less organelles fundamentally driven by multivalent interactions between proteins and/or nucleic acids (Li et al. 2012; Mondal et al. 2022), which can occur in proteins between multiple folded domains or are medicated by intrinsically disordered proteins (Chu et al. 2022). For example, a mutation in the Speckle-type POZ protein (SPOP), a tumor-suppressor protein, can lead to the formation of many solid tumors, including prostate, gastric, and colorectal cancers (Wang et al. 2021). A recent study revealed that substrates of SPOP can phase-separate with SPOP to form condensates in vitro and co-localize in liquid nuclear organelles in cells (Bouchard et al. 2018). Therefore, identification of PPI networks requires further improvement to uncover all possible interactions that may exist at the same cellular localization or the proteome level. In addition, it is indispensable to develop computational methods to predict potential PPIs rapidly and accurately from a large number of candidate protein pairs using only sequence information. The development of such methods will enable comprehensive PPI prediction between proteins, leading to the identification for interacting partners of proteins of interest.

In general, many ML-based PPI prediction methods discriminate whether two proteins interact or not, given output scores from those ML models, but these scores do not explain the strength of the interactions of the two proteins. In order to further improve the reliability of PPI prediction and capture PPI network properties more clearly, protein binding affinities, which are typically measured by the equilibrium dissociation constant (K_d), should be considered to assess predicted interactions. However, the determination of the protein binding affinity is generally not applicable on a large scale due to the dissociation rate dependent accuracy of experimental methods, cost and time constraints, and the need for protein complexes (Abbasi et al. 2020). Therefore, accurate computational techniques can play an important role in the protein binding affinity determination. In particular, sequence-based protein binding affinity prediction is challenging but, like sequence-based PPI prediction, is also an important research topic (Yugandhar and Gromiha 2014; Abbasi et al. 2020). In addition, further development of sequence-based interaction site prediction is also important, as detailed interaction site and residue information enables more accurate PPI prediction, more accurate prediction of protein binding affinity, and more accurate analysis of PPI network properties.

Various PPI prediction methods have been proposed and available as a web application or on an Internet hosting service for software development like GitHub (Table 2). The availability of these methods as a public resource is of great benefit to the drug discovery community as well as to further advances in the PPI prediction. ML-based methods, especially, DL-based method, are currently great success in the PPI prediction. Further improvements to these methods include development with more realistic datasets and construction of large size and independent unbiased datasets to evaluate the proposed methods, because the performance of these methods essentially depends on reliable training data with known PPIs determined experimentally. In addition, it will be necessary to develop protein encoding methods that can better capture various protein features including functional, structural, and evolutional information.

As evidenced by the recent increase in the development of sequence-based PPI prediction methods based on heterogeneous datasets (Tables 2 and 3), these methods are readily available and indispensable for solving pressing questions, such as the mechanism of infection, without relying on structural information. In addition, sequence-based PPI prediction alone cannot surpass the reliability of structure-based methods, but by leveraging recently developed protein structure prediction methods such as AlphaFold 2.0, sequence-based PPI prediction overcome the weakness, which will be expected to further improve the reliability of PPI predictions.

References

Abbasi WA, Yaseen A, Hassan FU, Andleeb S, Minhas F (2020) ISLAND: in-silico proteins binding affinity prediction using sequence information. BioData Min 13(1):20. https://doi.org/10.1186/s13040-020-00231-w
Article CAS Google Scholar
AcunerOzbabacan SE, Engin HB, Gursoy A, Keskin O (2011) Transient protein-protein interactions. Protein Eng Des Sel 24(9):635–648. https://doi.org/10.1093/protein/gzr025
Article CAS Google Scholar
Alachram H, Chereda H, Beissbarth T, Wingender E, Stegmaier P (2021) Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks. PLoS ONE 16(10):e0258623. https://doi.org/10.1371/journal.pone.0258623
Article CAS Google Scholar
Alanis-Lobato G, Andrade-Navarro MA, Schaefer MH (2017) HIPPIE v.20: enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Res 45(D1):D408–D414. https://doi.org/10.1093/nar/gkw985
Article CAS Google Scholar
Al-Janabi A (2022) Has DeepMind’s AlphaFold solved the protein folding problem? Biotechniques 72(3):73–76. https://doi.org/10.2144/btn-2022-0007
Article CAS Google Scholar
Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16(12):1315–1322. https://doi.org/10.1038/s41592-019-0598-1
Article CAS Google Scholar
Alonso-Lopez D, Campos-Laborie FJ, Gutierrez MA, Lambourne L, Calderwood MA, Vidal M, De Las Rivas J (2019) APID database: redefining protein-protein interaction experimental evidences and binary interactomes. Database (Oxford) 2019.https://doi.org/10.1093/database/baz005
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402. https://doi.org/10.1093/nar/25.17.3389
Article CAS Google Scholar
Ammari MG, Gresham CR, McCarthy FM, Nanduri B (2016) HPIDB 2.0: a curated database for host-pathogen interactions. Database Oxford 2016:baw103. https://doi.org/10.1093/database/baw103
Article CAS Google Scholar
Babu MM, Kriwacki RW, Pappu RV (2012) Structural biology. Versatility from Protein Disorder. Science 337(6101):1460–1461. https://doi.org/10.1126/science.1228775
Article CAS Google Scholar
Barman RK, Saha S, Das S (2014) Prediction of interactions between viral and host proteins using supervised machine learning methods. PLoS ONE 9(11):e112034. https://doi.org/10.1371/journal.pone.0112034
Article CAS Google Scholar
Bepler T, Berger B (2019) Learning protein sequence embeddings using information from structure. proceedings of ICLR 2019 abs/1902.08661:1–17. https://doi.org/10.48550/arXiv.1902.08661
Blohm P, Frishman G, Smialowski P, Goebels F, Wachinger B, Ruepp A, Frishman D (2014) Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res 42(Database issue):D396-400. https://doi.org/10.1093/nar/gkt1079
Article CAS Google Scholar
Bock JR, Gough DA (2001) Predicting protein–protein interactions from primary structure. Bioinformatics 17(5):455–460. https://doi.org/10.1093/bioinformatics/17.5.455
Article CAS Google Scholar
Bouchard JJ, Otero JH, Scott DC, Szulc E, Martin EW, Sabri N, Granata D, Marzahn MR, Lindorff-Larsen K, Salvatella X, Schulman BA, Mittag T (2018) Cancer mutations of the tumor suppressor SPOP disrupt the formation of active, phase-separated compartments. Mol Cell 72(1):19-36 e18. https://doi.org/10.1016/j.molcel.2018.08.027
Article CAS Google Scholar
Braun P, Gingras AC (2012) History of protein-protein interactions: from egg-white to complex networks. Proteomics 12(10):1478–1498. https://doi.org/10.1002/pmic.201100563
Article CAS Google Scholar
Breuer K, Foroushani AK, Laird MR, Chen C, Sribnaia A, Lo R, Winsor GL, Hancock RE, Brinkman FS, Lynn DJ (2013) InnateDB: systems biology of innate immunity and beyond—recent updates and continuing curation. Nucleic Acids Res 41(Database issue):D1228-1233. https://doi.org/10.1093/nar/gks1147
Article CAS Google Scholar
Calderone A, Licata L, Cesareni G (2015) VirusMentha: a new resource for virus-host protein interactions. Nucleic Acids Res 43(Database issue):D588-592. https://doi.org/10.1093/nar/gku830
Article CAS Google Scholar
Caterino M, Ruoppolo M, Mandola A, Costanzo M, Orru S, Imperlini E (2017) Protein-protein interaction networks as a new perspective to evaluate distinct functional roles of voltage-dependent anion channel isoforms. Mol Biosyst 13(12):2466–2476. https://doi.org/10.1039/c7mb00434f
Article CAS Google Scholar
Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G (2007) MINT: the Molecular INTeraction database. Nucleic Acids Res 35(Database issue):D572-574. https://doi.org/10.1093/nar/gkl950
Article CAS Google Scholar
Chatr-aryamontri A, Ceol A, Peluso D, Nardozza A, Panni S, Sacco F, Tinti M, Smolyar A, Castagnoli L, Vidal M, Cusick ME, Cesareni G (2009) VirusMINT: a viral protein interaction database. Nucleic Acids Res 37(Database issue):D669-673. https://doi.org/10.1093/nar/gkn739
Article CAS Google Scholar
Chen C, Zhang Q, Ma Q, Yu B (2019a) LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion. Chemom Intell Lab Syst 191:54–64. https://doi.org/10.1016/j.chemolab.2019.06.003
Article CAS Google Scholar
Chen M, Ju CJ, Zhou G, Chen X, Zhang T, Chang KW, Zaniolo C, Wang W (2019b) Multifaceted protein-protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35(14):i305–i314. https://doi.org/10.1093/bioinformatics/btz328
Article CAS Google Scholar
Chen J, Althagafi A, Hoehndorf R (2021) Predicting candidate genes from phenotypes, functions and anatomical site of expression. Bioinformatics 37(6):853–860. https://doi.org/10.1093/bioinformatics/btaa879
Article CAS Google Scholar
Chu X, Sun T, Li Q, Xu Y, Zhang Z, Lai L, Pei J (2022) Prediction of liquid-liquid phase separating proteins using machine learning. BMC Bioinformatics 23(1):72. https://doi.org/10.1186/s12859-022-04599-w
Article CAS Google Scholar
Clerc O, Deniaud M, Vallet SD, Naba A, Rivet A, Perez S, Thierry-Mieg N, Ricard-Blum S (2019) MatrixDB: integration of new data with a focus on glycosaminoglycan interactions. Nucleic Acids Res 47(D1):D376–D381. https://doi.org/10.1093/nar/gky1035
Article CAS Google Scholar
Craig RA, Liao L (2007) Phylogenetic tree information aids supervised learning for predicting protein-protein interaction based on distance matrices. BMC Bioinformatics 8:6. https://doi.org/10.1186/1471-2105-8-6
Article CAS Google Scholar
De Las RJ, Fontanillo C (2010) Protein-protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol 6(6):e1000807. https://doi.org/10.1371/journal.pcbi.1000807
Article CAS Google Scholar
Dick K, Samanfar B, Barnes B, Cober ER, Mimee B, Tan LH, Molnar SJ, Biggar KK, Golshani A, Dehne F, Green JR (2020) PIPE4: fast PPI predictor for comprehensive inter- and cross-species interactomes. Sci Rep 10(1):1390. https://doi.org/10.1038/s41598-019-56895-w
Article CAS Google Scholar
Ding Y, Tang J, Guo F (2016) Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinformatics 17(1):398. https://doi.org/10.1186/s12859-016-1253-9
Article Google Scholar
Dong TN, Brogden G, Gerold G, Khosla M (2021) A multitask transfer learning framework for the prediction of virus-human protein-protein interactions. BMC Bioinformatics 22(1):572. https://doi.org/10.1186/s12859-021-04484-y
Article CAS Google Scholar
Dos Santos Vasconcelos CR, de Lima CT, Rezende AM (2018) Building protein-protein interaction networks for Leishmania species through protein structural information. BMC Bioinformatics 19(1):85. https://doi.org/10.1186/s12859-018-2105-6
Article CAS Google Scholar
Du X, Sun S, Hu C, Yao Y, Yan Y, Zhang Y (2017) DeepPPI: boosting prediction of protein–protein interactions with deep neural networks. J Chem Inf Model 57(6):1499–1510. https://doi.org/10.1021/acs.jcim.7b00028
Article CAS Google Scholar
Du Y, Cai M, Xing X, Ji J, Yang E, Wu J (2021) PINA 3.0: mining cancer interactome. Nucleic Acids Res 49(D1):D1351–D1357. https://doi.org/10.1093/nar/gkaa1075
Article CAS Google Scholar
Duan G, Walther D (2015) The roles of post-translational modifications in the context of protein interaction networks. PLoS Comput Biol 11(2):e1004049. https://doi.org/10.1371/journal.pcbi.1004049
Article CAS Google Scholar
DurmusTekir S, Cakir T, Ardic E, Sayilirbas AS, Konuk G, Konuk M, Sariyer H, Ugurlu A, Karadeniz I, Ozgur A, Sevilgen FE, Ulgen KO (2013) PHISTO: pathogen-host interaction search tool. Bioinformatics 29(10):1357–1358. https://doi.org/10.1093/bioinformatics/btt137
Article CAS Google Scholar
Eid FE, ElHefnawi M, Heath LS (2016) DeNovo: virus-host sequence-based protein-protein interaction prediction. Bioinformatics 32(8):1144–1150. https://doi.org/10.1093/bioinformatics/btv737
Article CAS Google Scholar
Evans R, O’Neill M, Pritzel A, Antropova N, Senior A, Green T, Žídek A, Bates R, Blackwell S, Yim J, Ronneberger O, Bodenstein S, Zielinski M, Bridgland A, Potapenko A, Cowie A, Tunyasuvunakool K, Jain R, Clancy E, Kohli P, Jumper J, Hassabis D (2022) Protein complex prediction with AlphaFold-Multimer. DeepMind. https://doi.org/10.1101/2021.10.04.463034
Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152. https://doi.org/10.1093/bioinformatics/bts565
Article CAS Google Scholar
Garcia-Garcia J, Guney E, Aragues R, Planas-Iglesias J, Oliva B (2010) Biana: a software framework for compiling biological interactions and analyzing networks. BMC Bioinformatics 11:56. https://doi.org/10.1186/1471-2105-11-56
Article CAS Google Scholar
Garcia-Garcia J, Schleker S, Klein-Seetharaman J, Oliva B (2012) BIPS: BIANA Interolog Prediction Server. A tool for protein-protein interaction inference. Nucleic Acids Res 40(Web Server issue):W147-151. https://doi.org/10.1093/nar/gks553
Article CAS Google Scholar
Gordon DE, Jang GM, Bouhaddou M, Xu J, Obernier K, White KM, O’Meara MJ, Rezelj VV, Guo JZ, Swaney DL, Tummino TA, Huttenhain R, Kaake RM, Richards AL, Tutuncuoglu B, Foussard H, Batra J, Haas K, Modak M, Kim M, Haas P, Polacco BJ, Braberg H, Fabius JM, Eckhardt M, Soucheray M, Bennett MJ, Cakir M, McGregor MJ, Li Q, Meyer B, Roesch F, Vallet T, Mac Kain A, Miorin L, Moreno E, Naing ZZC, Zhou Y, Peng S, Shi Y, Zhang Z, Shen W, Kirby IT, Melnyk JE, Chorba JS, Lou K, Dai SA, Barrio-Hernandez I, Memon D, Hernandez-Armenta C, Lyu J, Mathy CJP, Perica T, Pilla KB, Ganesan SJ, Saltzberg DJ, Rakesh R, Liu X, Rosenthal SB, Calviello L, Venkataramanan S, Liboy-Lugo J, Lin Y, Huang XP, Liu Y, Wankowicz SA, Bohn M, Safari M, Ugur FS, Koh C, Savar NS, Tran QD, Shengjuler D, Fletcher SJ, O’Neal MC, Cai Y, Chang JCJ, Broadhurst DJ, Klippsten S, Sharp PP, Wenzell NA, Kuzuoglu-Ozturk D, Wang HY, Trenker R, Young JM, Cavero DA, Hiatt J, Roth TL, Rathore U, Subramanian A, Noack J, Hubert M, Stroud RM, Frankel AD, Rosenberg OS, Verba KA, Agard DA, Ott M, Emerman M, Jura N, von Zastrow M, Verdin E, Ashworth A, Schwartz O, d’Enfert C, Mukherjee S, Jacobson M, Malik HS, Fujimori DG, Ideker T, Craik CS, Floor SN, Fraser JS, Gross JD, Sali A, Roth BL, Ruggero D, Taunton J, Kortemme T, Beltrao P, Vignuzzi M, Garcia-Sastre A, Shokat KM, Shoichet BK, Krogan NJ (2020) A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature 583(7816):459–468. https://doi.org/10.1038/s41586-020-2286-9
Article CAS Google Scholar
Guirimand T, Delmotte S, Navratil V (2015) VirHostNet 2.0: surfing on the web of virus/host molecular interactions data. Nucleic Acids Res 43(Database issue):D583-587. https://doi.org/10.1093/nar/gku1121
Article CAS Google Scholar
Guo Y, Yu L, Wen Z, Li M (2008) Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res 36(9):3025–3030. https://doi.org/10.1093/nar/gkn159
Article CAS Google Scholar
Guo Y, Li M, Pu X, Li G, Guang X, Xiong W, Li J (2010) PRED_PPI: a server for predicting protein-protein interactions based on sequence data with probability assignment. BMC Res Notes 3:145. https://doi.org/10.1186/1756-0500-3-145
Article CAS Google Scholar
Hamp T, Rost B (2015) Evolutionary profiles improve protein-protein interaction prediction from sequence. Bioinformatics 31(12):1945–1950. https://doi.org/10.1093/bioinformatics/btv077
Article CAS Google Scholar
Hashemifar S, Neyshabur B, Khan AA, Xu J (2018) Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics 34(17):i802–i810. https://doi.org/10.1093/bioinformatics/bty573
Article CAS Google Scholar
HitPredict version 4 (2015) Comprehensive reliability scoring of physical protein-protein interactions from more than 100 species. Database (Oxford). https://doi.org/10.1093/database/bav117
Hsin Liu C, Li KC, Yuan S (2013) Human protein-protein interaction prediction by a novel sequence-based co-evolution method: co-evolutionary divergence. Bioinformatics 29(1):92–98. https://doi.org/10.1093/bioinformatics/bts620
Article CAS Google Scholar
Hu X, Feng C, Zhou Y, Harrison A, Chen M (2021) DeepTrio: a ternary prediction system for protein-protein interaction using mask multiple parallel convolutional neural networks. Bioinformatics. https://doi.org/10.1093/bioinformatics/btab737
Article Google Scholar
Hu X, Feng C, Ling T, Chen M (2022) Deep learning frameworks for protein-protein interaction prediction. Comput Struct Biotechnol J 20:3223–3233. https://doi.org/10.1016/j.csbj.2022.06.025
Article CAS Google Scholar
Huang YA, You ZH, Gao X, Wong L, Wang L (2015) Using weighted sparse representation model combined with discrete cosine transformation to predict protein-protein interactions from protein sequence. Biomed Res Int 2015:902198. https://doi.org/10.1155/2015/902198
Article CAS Google Scholar
Huang YA, You ZH, Chen X, Chan K, Luo X (2016) Sequence-based prediction of protein-protein interactions using weighted sparse representation model combined with global encoding. BMC Bioinformatics 17(1):184. https://doi.org/10.1186/s12859-016-1035-4
Article CAS Google Scholar
Huttlin EL, Ting L, Bruckner RJ, Gebreab F, Gygi MP, Szpyt J, Tam S, Zarraga G, Colby G, Baltier K, Dong R, Guarani V, Vaites LP, Ordureau A, Rad R, Erickson BK, Wuhr M, Chick J, Zhai B, Kolippakkam D, Mintseris J, Obar RA, Harris T, Artavanis-Tsakonas S, Sowa ME, De Camilli P, Paulo JA, Harper JW, Gygi SP (2015) The BioPlex network: a systematic exploration of the human interactome. Cell 162(2):425–440. https://doi.org/10.1016/j.cell.2015.06.043
Article CAS Google Scholar
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Zidek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583–589. https://doi.org/10.1038/s41586-021-03819-2
Article CAS Google Scholar
Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28(1):374. https://doi.org/10.1093/nar/28.1.374
Article CAS Google Scholar
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202-205. https://doi.org/10.1093/nar/gkm998
Article CAS Google Scholar
Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A (2009) Human Protein Reference Database—2009 update. Nucleic Acids Res 37(Database issue):D767-772. https://doi.org/10.1093/nar/gkn892
Article CAS Google Scholar
Khojasteh H, Khanteymoori A, Olyaee MH (2022) Comparing protein-protein interaction networks of SARS-CoV-2 and (H1N1) influenza using topological features. Sci Rep 12(1):5867. https://doi.org/10.1038/s41598-022-08574-6
Article CAS Google Scholar
Laskowski RA, Jablonska J, Pravda L, Varekova RS, Thornton JM (2018) PDBsum: structural summaries of PDB entries. Protein Sci 27(1):129–134. https://doi.org/10.1002/pro.3289
Article CAS Google Scholar
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. Proc 31st Int Conf Mach Learn, PMLR 32(2):1188–1196. https://doi.org/10.48550/arXiv.1405.4053
Google Scholar
Li Y, Ilie L (2017) SPRINT: ultrafast protein-protein interaction prediction of the entire human interactome. BMC Bioinformatics 18(1):485. https://doi.org/10.1186/s12859-017-1871-x
Article CAS Google Scholar
Li P, Banjade S, Cheng HC, Kim S, Chen B, Guo L, Llaguno M, Hollingsworth JV, King DS, Banani SF, Russo PS, Jiang QX, Nixon BT, Rosen MK (2012) Phase transitions in the assembly of multivalent signalling proteins. Nature 483(7389):336–340. https://doi.org/10.1038/nature10879
Article CAS Google Scholar
Li J, Guo M, Tian X, Wang X, Yang X, Wu P, Liu C, Xiao Z, Qu Y, Yin Y, Wang C, Zhang Y, Zhu Z, Liu Z, Peng C, Zhu T, Liang Q (2021) Virus-host interactome and proteomic survey reveal potential virulence factors influencing SARS-CoV-2 pathogenesis. Med (N Y) 2(1):99-112 e117. https://doi.org/10.1016/j.medj.2020.07.002
Article CAS Google Scholar
Li X, Han P, Wang G, Chen W, Wang S, Song T (2022) SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics 23(1):474. https://doi.org/10.1186/s12864-022-08687-2
Article CAS Google Scholar
Liu X, Yang Z, Sang S, Lin H, Wang J, Xu B (2019) Detection of protein complexes from multiple protein interaction networks using graph embedding. Artif Intell Med 96:107–115. https://doi.org/10.1016/j.artmed.2019.04.001
Article Google Scholar
Liu-Wei W, Kafkas S, Chen J, Dimonaco NJ, Tegner J, Hoehndorf R (2021) DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes. Bioinformatics. https://doi.org/10.1093/bioinformatics/btab147
Article Google Scholar
Lua RC, Marciano DC, Katsonis P, Adikesavan AK, Wilkins AD, Lichtarge O (2014) Prediction and redesign of protein-protein interactions. Prog Biophys Mol Biol 116(2–3):194–202. https://doi.org/10.1016/j.pbiomolbio.2014.05.004
Article CAS Google Scholar
Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M (2001) Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs.” Genome Res 11(12):2120–2126. https://doi.org/10.1101/gr.205301
Article CAS Google Scholar
Meszaros B, Simon I, Dosztanyi Z (2009) Prediction of protein binding regions in disordered proteins. PLoS Comput Biol 5(5):e1000376. https://doi.org/10.1371/journal.pcbi.1000376
Article CAS Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. NIPS’13: Proc 26th Int Conf Neural Inf Process Syst 2:3111–3119
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR arXiv:1301.3781v1. https://doi.org/10.48550/arXiv.1301.3781
Mondal S, Narayan K, Botterbusch S, Powers I, Zheng J, James HP, Jin R, Baumgart T (2022) Multivalent interactions between molecular components involved in fast endophilin mediated endocytosis drive protein phase separation. Nat Commun 13(1):5017. https://doi.org/10.1038/s41467-022-32529-0
Article CAS Google Scholar
Murakami Y, Mizuguchi K (2014) Homology-based prediction of interactions between proteins using Averaged One-Dependence Estimators. BMC Bioinformatics 15:213. https://doi.org/10.1186/1471-2105-15-213
Article CAS Google Scholar
Murakami Y, Tripathi LP, Prathipati P, Mizuguchi K (2017) Network analysis and in silico prediction of protein-protein interactions with applications in drug discovery. Curr Opin Struct Biol 44:134–142. https://doi.org/10.1016/j.sbi.2017.02.005
Article CAS Google Scholar
Ochoa D, Pazos F (2010) Studying the co-evolution of protein families with the Mirrortree web server. Bioinformatics 26(10):1370–1371. https://doi.org/10.1093/bioinformatics/btq137
Article CAS Google Scholar
Ohue M, Matsuzaki Y, Uchikoga N, Ishida T, Akiyama Y (2014) MEGADOCK: an all-to-all protein-protein interaction prediction system using tertiary structure data. Protein Pept Lett 21(8):766–778. https://doi.org/10.2174/09298665113209990050
Article CAS Google Scholar
Orchard S, Kerrien S, Abbani S, Aranda B, Bhate J, Bidwell S, Bridge A, Briganti L, Brinkman FS, Cesareni G, Chatr-aryamontri A, Chautard E, Chen C, Dumousseau M, Goll J, Hancock RE, Hannick LI, Jurisica I, Khadake J, Lynn DJ, Mahadevan U, Perfetto L, Raghunath A, Ricard-Blum S, Roechert B, Salwinski L, Stumpflen V, Tyers M, Uetz P, Xenarios I, Hermjakob H (2012) Protein interaction data curation: the International Molecular Exchange (IMEx) consortium. Nat Methods 9(4):345–350. https://doi.org/10.1038/nmeth.1931
Article CAS Google Scholar
Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, Del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H (2014) The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42(Database issue):D358-363. https://doi.org/10.1093/nar/gkt1115
Article CAS Google Scholar
Oughtred R, Stark C, Breitkreutz BJ, Rust J, Boucher L, Chang C, Kolas N, O’Donnell L, Leung G, McAdam R, Zhang F, Dolma S, Willems A, Coulombe-Huntington J, Chatr-Aryamontri A, Dolinski K, Tyers M (2019) The BioGRID interaction database: 2019 update. Nucleic Acids Res 47(D1):D529–D541. https://doi.org/10.1093/nar/gky1079
Article CAS Google Scholar
Oughtred R, Rust J, Chang C, Breitkreutz BJ, Stark C, Willems A, Boucher L, Leung G, Kolas N, Zhang F, Dolma S, Coulombe-Huntington J, Chatr-Aryamontri A, Dolinski K, Tyers M (2021) The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci 30(1):187–200. https://doi.org/10.1002/pro.3978
Article CAS Google Scholar
Pan XY, Zhang YN, Shen HB (2010) Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. J Proteome Res 9(10):4992–5001. https://doi.org/10.1021/pr100618t
Article CAS Google Scholar
Park Y, Marcotte EM (2012) Flaws in evaluation schemes for pair-input computational predictions. Nat Methods 9(12):1134–1136. https://doi.org/10.1038/nmeth.2259
Article CAS Google Scholar
Pazos F, Valencia A (2001) Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng 14(9):609–614. https://doi.org/10.1093/protein/14.9.609
Article CAS Google Scholar
Pedamallu CS, Posfai J (2010) Open source tool for prediction of genome wide protein-protein interaction network based on ortholog information. Source Code Biol Med 5:8. https://doi.org/10.1186/1751-0473-5-8
Article CAS Google Scholar
Pierce B, Weng Z (2007) ZRANK: reranking protein docking predictions with an optimized energy function. Proteins 67(4):1078–1086. https://doi.org/10.1002/prot.21373
Article CAS Google Scholar
Pierce BG, Wiehe K, Hwang H, Kim BH, Vreven T, Weng Z (2014) ZDOCK server: interactive docking prediction of protein-protein complexes and symmetric multimers. Bioinformatics 30(12):1771–1773. https://doi.org/10.1093/bioinformatics/btu097
Article CAS Google Scholar
Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, Greenblatt J, Jessulat M, Krogan N, Luo X, Golshani A (2006) PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinformatics 7:365. https://doi.org/10.1186/1471-2105-7-365
Article CAS Google Scholar
Pitre S, Hooshyar M, Schoenrock A, Samanfar B, Jessulat M, Green JR, Dehne F, Golshani A (2012) Short co-occurring polypeptide regions can predict global protein interaction maps. Sci Rep 2:239. https://doi.org/10.1038/srep00239
Article CAS Google Scholar
Qi Y, Bar-Joseph Z, Klein-Seetharaman J (2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 63(3):490–500. https://doi.org/10.1002/prot.20865
Article CAS Google Scholar
Romero-Molina S, Ruiz-Blanco YB, Harms M, Munch J, Sanchez-Garcia E (2019) PPI-Detect: a support vector machine model for sequence-based prediction of protein-protein interactions. J Comput Chem 40(11):1233–1242. https://doi.org/10.1002/jcc.25780
Article CAS Google Scholar
Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32(Database issue):D449-451. https://doi.org/10.1093/nar/gkh086
Article Google Scholar
Sarkar D, Saha S (2019) Machine-learning techniques for the prediction of protein-protein interactions. J Biosci 44:(4). https://doi.org/10.1007/s12038-019-9909-z
Seet BT, Dikic I, Zhou MM, Pawson T (2006) Reading protein modifications with interaction domains. Nat Rev Mol Cell Biol 7(7):473–483. https://doi.org/10.1038/nrm1960
Article CAS Google Scholar
Sledzieski S, Singh R, Cowen L, Berger B (2021) D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell Syst 12(10):969-982 e966. https://doi.org/10.1016/j.cels.2021.08.010
Article CAS Google Scholar
Song X-Y, Chen Z-H, Sun X-Y, You Z-H, Li L-P, Zhao Y (2018) An ensemble classifier with random projection for predicting protein–protein interactions using sequence and evolutionary information. Appl Sci 8(1):89. https://doi.org/10.3390/app8010089
Article Google Scholar
Sun T, Zhou B, Lai L, Pei J (2017) Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics 18(1):277. https://doi.org/10.1186/s12859-017-1700-2
Article CAS Google Scholar
Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, Doncheva NT, Legeay M, Fang T, Bork P, Jensen LJ, von Mering C (2021) The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res 49(D1):D605–D612. https://doi.org/10.1093/nar/gkaa1074
Article CAS Google Scholar
Tripathi LP, Chen Y-A, Mizuguchi K, Murakami Y (2019) Network-based analysis for biological discovery. In: Ranganathan S, Gribskov M, Nakai K, Schönbach C (eds) Encyclopedia of Bioinformatics and Computational Biology. Academic Press, Oxford, pp 283–291. https://doi.org/10.1016/B978-0-12-809633-8.20674-2
Chapter Google Scholar
Tsukiyama S, Hasan MM, Fujii S, Kurata H (2021) LSTM-PHV: prediction of human-virus protein-protein interactions by LSTM with word2vec. Brief Bioinform 22 (6). https://doi.org/10.1093/bib/bbab228
von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417(6887):399–403. https://doi.org/10.1038/nature750
Article CAS Google Scholar
Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thierry-Mieg N, Vidal M (2000) Protein interaction mapping in C. elegans using proteins involved in vulval development. Science 287(5450):116–122. https://doi.org/10.1126/science.287.5450.116
Article CAS Google Scholar
Wang YB, You ZH, Li LP, Huang YA, Yi HC (2017) Detection of interactions between proteins by using Legendre moments descriptor to extract discriminatory information embedded in PSSM. Molecules 22(8):1366. https://doi.org/10.3390/molecules22081366
Article CAS Google Scholar
Wang B, Zhang L, Dai T, Qin Z, Lu H, Zhang L, Zhou F (2021) Liquid-liquid phase separation in human health and diseases. Signal Transduct Target Ther 6(1):290. https://doi.org/10.1038/s41392-021-00678-1
Article CAS Google Scholar
Warwicker J (2022) The physical basis for pH sensitivity in biomolecular structure and function, with application to the spike protein of SARS-CoV-2. Front Mol Biosci 9:834011. https://doi.org/10.3389/fmolb.2022.834011
Article CAS Google Scholar
wwPDBc (2019) Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res 47(D1):D520–D528. https://doi.org/10.1093/nar/gky949
Article CAS Google Scholar
Yang F, Fan K, Song D, Lin H (2020a) Graph-based prediction of protein-protein interactions with attributed signed graph embedding. BMC Bioinformatics 21(1):323. https://doi.org/10.1186/s12859-020-03646-8
Article CAS Google Scholar
Yang X, Yang S, Li Q, Wuchty S, Zhang Z (2020b) Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Comput Struct Biotechnol J 18:153–161. https://doi.org/10.1016/j.csbj.2019.12.005
Article CAS Google Scholar
Yang X, Yang S, Lian X, Wuchty S, Zhang Z (2021) Transfer learning via multi-scale convolutional neural layers for human-virus protein-protein interaction prediction. Bioinformatics. https://doi.org/10.1093/bioinformatics/btab533
Article Google Scholar
Yao Y, Du X, Diao Y, Zhu H (2019) An integration of deep learning with feature embedding for protein-protein interaction prediction. PeerJ 7:e7126. https://doi.org/10.7717/peerj.7126
Article Google Scholar
You ZH, Huang WZ, Zhang S, Huang YA, Yu CQ, Li LP (2019) An efficient ensemble learning approach for predicting protein-protein interactions by integrating protein primary sequence and evolutionary information. IEEE/ACM Trans Comput Biol Bioinf 16(3):809–817. https://doi.org/10.1109/TCBB.2018.2882423
Article Google Scholar
Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JD, Bertin N, Chung S, Vidal M, Gerstein M (2004) Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res 14(6):1107–1118. https://doi.org/10.1101/gr.1774904
Article CAS Google Scholar
Yu B, Chen C, Wang X, Yu Z, Ma A, Liu B (2021) Prediction of protein–protein interactions based on elastic net and deep forest. Expert Syst Appl 176:114876. https://doi.org/10.1016/j.eswa.2021.114876
Article Google Scholar
Yu D, Chojnowski G, Rosenthal M, Kosinski J (2022) AlphaPulldown-a Python package for protein-protein interaction screens using AlphaFold-Multimer. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac749
Yugandhar K, Gromiha MM (2014) Protein-protein binding affinity prediction from amino acid sequence. Bioinformatics 30(24):3583–3589. https://doi.org/10.1093/bioinformatics/btu580
Article CAS Google Scholar
Zhang QC, Petrey D, Deng L, Qiang L, Shi Y, Thu CA, Bisikirska B, Lefebvre C, Accili D, Hunter T, Maniatis T, Califano A, Honig B (2012) Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490(7421):556–560. https://doi.org/10.1038/nature11503
Article CAS Google Scholar
Zhou X, Park B, Choi D, Han K (2018) A generalized approach to predicting protein-protein interactions between virus and host. BMC Genomics 19(Suppl 6):568. https://doi.org/10.1186/s12864-018-4924-2
Article CAS Google Scholar
Zhou YZ, Gao Y, Zheng YY (2011) Prediction of protein-protein interactions using local description of amino acid sequence. Advances in Computer Science and Education Applications, pp 254–262. https://doi.org/10.1007/978-3-642-22456-0_37

Download references

Author information

Authors and Affiliations

Tokyo University of Information Sciences, 4-1 Onaridai, Wakaba-Ku, Chiba, 265-8501, Japan
Yoichi Murakami
Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita-Shi, Osaka, 565-0871, Japan
Kenji Mizuguchi
National Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito Asagi, Ibaraki, Osaka, 567-0085, Japan
Kenji Mizuguchi

Authors

Yoichi Murakami
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Mizuguchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoichi Murakami.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Murakami, Y., Mizuguchi, K. Recent developments of sequence-based prediction of protein–protein interactions. Biophys Rev 14, 1393–1411 (2022). https://doi.org/10.1007/s12551-022-01038-1

Download citation

Accepted: 08 December 2022
Published: 24 December 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s12551-022-01038-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Recent developments of sequence-based prediction of protein–protein interactions

Abstract

Similar content being viewed by others

An Overview of Scoring Functions Used for Protein–Ligand Interactions in Molecular Docking

A review of machine learning-based methods for predicting drug–target interactions

Advances in Structural Bioinformatics

Introduction

PPI databases

Preparation of PPI datasets

Sequence-based prediction of PPIs

Statistical methods

Similarity-based methods

ML-based methods

Protein feature encoding

Current issues

False positives, overfitting, and underfitting

ML-based methods, realistic datasets

Future perspectives and conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Recent developments of sequence-based prediction of protein–protein interactions

Abstract

Similar content being viewed by others

An Overview of Scoring Functions Used for Protein–Ligand Interactions in Molecular Docking

A review of machine learning-based methods for predicting drug–target interactions

Advances in Structural Bioinformatics

Introduction

PPI databases

Preparation of PPI datasets

Sequence-based prediction of PPIs

Statistical methods

Similarity-based methods

ML-based methods

Protein feature encoding

Current issues

False positives, overfitting, and underfitting

ML-based methods, realistic datasets

Future perspectives and conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation