Introduction

Proteins are biological macromolecules composed of one or more chains of amino acid residues and play many critical roles in living cells, participating in a variety of biological processes, such as catalyzing chemical reactions, synthesizing and repairing DNA, and receiving and sending chemical signals. In order to perform their biological functions, they interact with other molecules such as ions, membrane lipids, DNA, and proteins by making direct physical contact through their specific residues. Interactions between proteins, i.e., protein–protein interactions (PPIs), are essential to the formation of macromolecular structures and to almost every biological process (Braun and Gingras 2012; Caterino et al. 2017; Dos Santos Vasconcelos et al. 2018; Liu et al. 2019). Those interactions are made through non-covalent contacts, electrostatic forces, or hydrophobic effects, between specific residues on proteins (De Las and Fontanillo 2010).

The identification of essential proteins required for the survival and development of the cell is important in understanding cell life and will help us better understand diseases and develop new drugs. Also, since almost every biological process involves one or more PPIs, the identification of PPIs in an organism is useful for understanding the molecular mechanisms underlying specific biological processes and for elucidating biological functions of proteins. In addition, comprehensive PPI networks associated with normal and abnormal physiological conditions are necessary not only for understanding physiological mechanisms but also for drug discovery for specific disorders, such as neurological disorders including Alzheimer disease and Creutzfeldt-Jacob disease (Qi et al. 2006; von Mering et al. 2002; Pedamallu and Posfai 2010). Furthermore, the identification of interspecies PPIs, such as virus-host PPIs, is also useful for understating infection mechanisms and for the design of new antiviral drugs and the treatment of infected patients. For example, studies of PPI networks of SARS-CoV-2 and (H1N1) influenza, which have similar clinical symptoms, have shown that virus-human PPIs are involved in multiple heterogeneous processes, including protein trafficking, translation, transcription, and ubiquitination (Khojasteh et al. 2022). These studies can help reveal similarities and difference between the two viruses.

There are two types of experimental methods for identifying PPIs: large-scale and high-throughput experiment methods and target-specific methods. The former screens large-scale PPIs by expressing each protein and exhaustively probes interactions between proteins of interest, such as yeast two-hybrid system, tandem affinity purification mass spectrometry, protein chip technology, and phage display. The latter determines a complex structure of a specific PPI of interest, such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy. The latter can determine interactions at the atomic level.

Although many experimental methods have been developed to identify PPIs, our knowledge of the whole set of PPIs in a particular cell or organism, i.e., interactome, is still incomplete, as numerous factors affect the detection of potential PPIs by those experiments; key factors include transient interactions, post-translational modifications (PTMs) (Seet et al. 2006; Duan and Walther 2015), intrinsically disordered regions (Acuner Ozbabacan et al. 2011; Lua et al. 2014; Meszaros et al. 2009; Babu et al. 2012), and other physiological conditions. Moreover, experimental methods are costly, time-consuming, and labor-intensive. Hence, the development of reliable computational methods for predicting PPIs is required. These computational methods can confirm or supplement experimentally detected PPIs, and add extra information, allowing us, for example, to prioritize therapeutically relevant PPIs as a target or an off-target.

There are mainly two types of computational methods for predicting PPIs: data-driven methods and molecular docking. The former predicts PPIs based on various features of protein pairs, such as interlog (Yu et al. 2004), the protein sequences (Huang et al. 2016; Eid et al. 2016), physicochemical properties (Romero-Molina et al. 2019), evolutionary profiles (Hamp and Rost 2015), and structural information (Zhang et al. 2012), with statistical models or machine learning (ML), which discover relations among the training data of known PPIs. Various ML algorithms have been used in this field, such as random forest (RF), support vector machine (SVM), and ensemble classifiers. In recent publications, deep learning (DL), which is a subset of ML methods based on artificial neural networks, has been recognized as a powerful technique through benchmarking on blind data sets. The second type of methods, molecular docking, searches for the potential binding mode of proteins with surface complementarity and interaction energies (Pierce et al. 2014; Pierce and Weng 2007; Ohue et al. 2014). AlphaFold-Multimer is an extension of AlphaFold 2.0, specifically built for predicting protein complexes with high accuracy (Evans et al. 2022). AlphaFold 2.0 is the first computational method capable of predicting monomeric protein structures with near-experimental accuracy (Jumper et al. 2021). These methods may be considered reliable tools for the prediction of protein structures or protein complexes and will be implemented for the prediction of multimeric protein complexes (Al-Janabi 2022). AlphaPulldown is a Python package for screening PPIs and high-throughput modeling of higher-order oligomers using AlphaFold-Multimer (Yu et al. 2022). However, AlphaPulldown requires structural templates for each query protein, and we still need to know which protein pairs will interact without depending on the presence or absence of protein structures or structural templates. Furthermore, proteins are dynamic and can change their conformations, for example, under different pH conditions (Warwicker 2022). Therefore, the data-driven methods for predicting PPIs, especially sequence-based PPI predictions, which allow exhaustive PPI searches, will continue to be important. In this brief review, we will focus on this type of methods. Below, we will introduce currently available PPI databases and recent sequence-based methods for predicting PPIs. Also, we will discuss key issues in this field and present future perspectives of the sequence-based PPI predictions.

PPI databases

The development of reliable methods for predicting PPIs requires a diverse and representative dataset of known interacting protein pairs. PPIs determined experimentally have been registered as computer-readable data in a database to use in biochemical studies. There are mainly two types of PPI databases: primary databases and secondary databases.

The primary databases collect experimentally derived data which are submitted directly from researchers or collected from peer-reviewed publications. For example, DIP (Database of Interacting Proteins) stores experimentally determined PPIs, both manually curated by expert curators and automatically curated using computational approaches (Salwinski et al. 2004). IntAct (Orchard et al. 2014) and BioGRID (Oughtred et al. 2021) are open-source and comprehensive PPI databases and provide analysis tools for molecular interaction data. IntAct also provides high-quality negative PPIs and several disease-specific datasets, such as interactions associated with Alzheimer’s disease and interactions investigated in the context of cancer or coronavirus. These data are useful and unique among other PPI databases. BioGRID also includes chemical interactions between genes/proteins and bioactive small molecules, and post-translational modifications (Oughtred et al. 2019, 2021).

On the other hand, the secondary databases comprise PPI data derived from numerous primary or other secondary databases using rigorous computational approaches. For example, HitPredict is a database of experimentally determined PPIs from IntAct, BioGrid, MINT (Chatr-aryamontri et al. 2007), DIP, MatrixDB (Clerc et al. 2019), and InnateDB (Breuer et al. 2013), with confidence scores assigned (Lopez et al. 2015). Those scores are calculated based on the experimental details of each interaction and the sequence, structure, and functional annotations of the interacting proteins. PINA (Protein Interaction Network Analysis) platform is a database that integrates PPIs from IntAct, BioGrid, MINT, DIP, and HPRD and provides a variety of web tools to construct, filter, and analyze the networks of proteins of interest (Du et al. 2021). APID (Agile Protein Interaction Data Analyzer) (Alonso-Lopez et al. 2019) provides a collection of known experimentally validated PPIs for more than 400 organisms from DIP (Salwinski et al. 2004), IntAct, MINT, HPRD (Keshava Prasad et al. 2009), BioGRID, BioPlex (Huttlin et al. 2015), and also from experimentally resolved 3D structures, PDB (ww 2019) and PDBsum (Laskowski et al. 2018), indicating different quality levels, i.e., whether interactions are proven by at least one binary detection method or not. This database also provides an interactive data visualization web tool that allows the construction of subinteractomes from query lists of proteins and the exploration and analysis of the corresponding networks about PPIs of interest. HIPPE (Human Integrated Protein–Protein Interaction rEference) provides functionally annotated human PPIs integrated from 10 primary databases and manually curated PPI data with the confidence scoring of experimentally measured interactions (Alanis-Lobato et al. 2017). The parameters of this scoring scheme were jointly optimized by human experts and a computer algorithm. STRING provides known PPIs including direct (physical) and indirect (functional) associations and provides predicted PPIs from automated text-mining of the scientific literature, conserved co-expression, and genomic context predictions (Szklarczyk et al. 2021). In this database, each PPI is annotated with various scores computed based on interaction evidence from the organism of interest or systematic transfers of interaction evidence from one organism to another.

In addition, there is a database called Negatome for experimentally supported non-interacting protein pairs (non-PPIs) collected by manual curation of the literature and computational analysis of protein complexes registered in PDB, excluding interactions from IntAct (Blohm et al. 2014). This database is especially important for training PPI prediction algorithms because it is complementary to the negative data generated by other methods such as randomly selecting proteins from different cellular locations.

Furthermore, there is an international collaboration, IMEx (International Molecular Exhange Consortium), between several institutions providing PPI data in order to develop a single set of curation rules for the registration of PPI data derived from experimentally derived data, pre-prints, and peer-reviewed publications and to standardize the data formats of PPI data (Orchard et al. 2012).

The databases and the number of PPIs registered are listed in Table 1.

Table 1 Currently available primary and secondary PPI databases and non-PPI database (as of November 2022)

Preparation of PPI datasets

Preparation of a high-quality dataset is crucial for the sequence-based PPI prediction. The experimentally determined PPIs sourced from the primary or secondary databases shown in Table 1 are normally merged into a set of PPIs as positive samples, excluding interactions between similar proteins. On the other hand, the preparation of non-PPIs can be more important than that of PPIs, because the quality and quantity of non-PPIs influence the PPI predictions significantly. Most PPI prediction methods require training with both positive and negative samples. One simple method is to generate negative samples by randomly pairing proteins in the positive samples and ignoring the actual interactions, assuming that randomly paired proteins are unlikely to be positive samples (dissimilarity negative sampling). Methods with more realistic considerations have been proposed. For example, Hamp and Rost (2015) generated negative samples by randomly sampling from all the pairs in each of the four PPI datasets: one training dataset and three testing datasets C1–C3 (C1, test pairs sharing both proteins with the training dataset; C2, test pairs sharing only one protein with the training dataset; and C3, test pairs sharing neither protein with the training dataset). The need to distinguish between these classes C1–C3 was introduced by Park and Marcotte (2012). Sun et al. (2017) generated negative samples by paring proteins found in different subcellular locations, excluding proteins annotated with ambiguous or uncertain subcellular location terms and with two or more locations.

The size of the negative samples and the size balance between positive and negative samples is one issue that should be carefully considered in developing an accurate and reliable method to predict potential PPIs. One common solution is to randomly sample negative samples, keeping a ratio of positive and negative samples. Hamp and Rost (2015) sampled 10 times as many negative samples as positive samples. However, the data imbalance is an issue that needs to be discussed and solved. More details of training datasets, test datasets, and independent test datasets used in recently developed methods and their data sources are shown in Table 2.

Table 2 Available PPI datasets and independent test datasets used in PPI prediction methods developed within the last 3 years (after 2019)

Sequence-based prediction of PPIs

The sequence-based prediction of PPI refers to the problems of inferring, given a pair of protein sequences, the likelihood of an interaction between them, i.e., a score that represents their interacting probability. This approach can be applied to the inference of PPI networks by adding new nodes and new edges to the PPI network graph (Murakami et al. 2017; Tripathi et al. 2019). So far, many computational methods to solve this problem have been proposed as complementary to experimental methods. Even though these sequenced-based methods are less accurate than structure-based methods, they are useful in predicting PPIs involving proteins, for which structural information is unknown or which are intrinsically disordered proteins. In addition, primary structures are available for all proteins, and thus, modeling and predicting PPIs using only sequence information has long been of interest. Almost all these methods are data-driven methods and can be categorized into statistical methods, similarity-based methods, and ML-based methods; however, most of the methods used in recent years are based on similarity or ML. Currently available web servers or downloadable programs for PPI prediction are shown in Table 3, along with a brief description of their strengths and weaknesses. In addition, the reported benchmark results of PPI prediction methods developed within the last three years are shown in Table 4; however, a fair performance comparison is difficult due to the different test datasets.

Table 3 Currently available web servers or downloadable programs for PPI prediction, and their strengths and weaknesses
Table 4 The reported benchmark results of PPI prediction methods developed within the last 3 years (after 2019)

Statistical methods

The statistical methods generally employed the statistical characteristics or the conserved patterns of protein sequences, assuming that functionally important proteins are conserved across organisms, such as the topological similarity between phylogenetic trees of a pair of proteins (Pazos and Valencia 2001), the co-occurrence of a fine number of short polypeptide sequences observed in known interacting protein pairs (Pitre et al. 2006), and the co-evolutionary divergence based on the assumption that protein pairs with similar substitution rates are likely to interact with each other (Hsin Liu et al. 2013). MirrorTree is a currently available server used to detect the coevolution between proteins and predicts their physical interactions (Ochoa and Pazos 2010). The underlying principle behind this method is that the co-evolution between interacting proteins can be reflected from the similarity scores from the distance matrices of the corresponding phylogenetic trees of the interacting proteins (Craig and Liao 2007).

Similarity-based methods

The similarity-based methods basically employed homologous interactions, in which two PPIs are homologous if a pair of interacting proteins is homologous to a pair of other interacting proteins. The homologous interactions basically include, but are not limited to, orthologous interactions (homologous interactions found in different organisms), i.e., interolog (Walhout et al. 2000), and paralogous interactions (homologous interactions in the same organisms). For example, BIPS (Biana Interolog Prediction Server) (Garcia-Garcia et al. 2012) is based on interolog information, assuming that the homologous proteins preserve similar functional behavior and also the same interactions (Matthews et al. 2001; Yu et al. 2004), and predicts interactions between proteins based on PPIs found in several PPI-related databases integrated using the BIONA (Biologic Interactions and Network Analysis) framework (Garcia-Garcia et al. 2010). SPRINT (Li and Ilie 2017) and PIPE4 (Dick et al. 2020) are based on the idea that a pair of query proteins (X1, X2) has an interaction if X1 and X2 are similar to either of the known interacting protein pair (P1, P2); that is, X1 is similar to P1 and X2 is similar to P2. However, these methods do not always work well in the absence of known interacting protein pairs with high-sequence similarity to the query protein pairs.

ML-based methods

The ML-based methods employ various supervised ML algorithms, such as SVM, RF, and DL. These algorithms are used for most of the existing PPI prediction methods. SVM aims to find a maximum margin hyperplane in an n-dimensional space (n is the number of features) that separates the labelled samples, i.e., maximizing the distance between samples of different classes. SVM requires computing power to train and test high-dimensional features with radial basis function (RDF) kernel that transforms linearly inseparable samples to linearly separable ones (kernel trick). RF is an ensemble learning method involving numerous decision trees (DT) for classification and outputs the class selected by most trees. RF can effectively train a large dataset of PPIs and vectors with many features and can rank the feature importance for accurate prediction. To train an RF model, the optimal value of the number of trees in the forest is usually adjusted, concerning the computational time and the accuracy. DL is an artificial neural network with multiple layers between the input and output layers. DL is considered to achieve better performance than the conventional ML-based methods in the PPI predictions. DL consists of several algorithms, such as Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Graph Convolutional Networks (GCN). These different algorithms have been applied to the PPI predictions and require the different input forms of proteins. DNN requires a one-dimensional vector, while other algorithms input flexible forms, for example, a two-dimensional matrix such as position-specific scoring matrix (PSSM) (Hu et al. 2022). Recently developed DL-based methods are shown in Table 3.

Protein feature encoding

One important issue for the ML-based methods is how to encode protein sequences of variable lengths into fixed-length numeric feature vectors used for the development of the prediction models and the prediction of PPIs. A pair of feature vectors encoding a protein pair is generally inputted to the method either by combining them sequentially or separately. In addition, the extraction of appropriate features from protein sequences is critical for the accurate PPI prediction.

Commonly used protein feature encoding methods includes physicochemical properties of amino acids (AA) (Bock and Gough 2001; Sun et al. 2017), protein sequence profiles (evolutionary profiles) (Liu et al. 2019; Hashemifar et al. 2018; Hamp and Rost 2015), and protein sequence embedding (Alachram et al. 2021).

Various physicochemical properties of AA are used, such as hydropathy index (hydrophobic or hydrophilic properties of an AA side chain), positively or negatively charged AA, uncharged AA, and pKa value (the acid dissociation constant at logarithmic scale which is a quantitative measure of the strength of an acid in solution). These properties are available in the AAindex (https://www.genome.jp/aaindex/) database (Kawashima and Kanehisa 2000; Kawashima et al. 2008).

Protein sequence profiles are a list of preferences for each AA at each position in a given multiple sequence alignment (MSA), i.e., a PSSM, which is derived from MSA in position-specific iterative BLAST (PSI-BLAST) (Altschul et al. 1997). PSSM is informative protein feature based on evolutionary information extracted from MSA, even though an enormous search time to compute it is required by PSI-BLAST. In addition, PSSM can reproduce evolutionary conserved interactions between protein sequences through their evolutionary information. For example, DPPI (Hashemifar et al. 2018) and TransPPI (Yang et al. 2021) employ PSSM which is a \(N\times 20\) matrix M = {Mij, i = 1…N, j = 1…20}, where N is the length of a given protein sequence and each element Mij is the score of the jth AA in the ith position of the sequence.

Protein sequence embedding captures semantic information on AA residues in entire sequences. The widely used embedding methods, such as Word2Vec (Mikolov et al. 2013a) and Doc2Vec (Le and Mikolov 2014), was originally developed in the field of natural language processing in order to obtain the distributed representation of words and documents. These methods are learned from the contexts of words in each document using a shallow two-layer neural network with continuous bag-of-words model (CBOW) and the continuous skip-gram model (Skip-Gram). CBOW predicts the target words from the context of documents, while Skip-Gram predicts the target context from entire words. CBOW learns faster while Skip-Gram does more efficient with small amounts of training data and has better representations for infrequent words (Mikolov et al. 2013b). In the biological sequences, a sequence is regarded as a sentence and represented by multiple k consecutive AA (k-mers) used to train the Word2Vec or Doc2Vec models. These methods were recently applied to the prediction of human and virus protein interactions, showing that they learned the protein features well that enable the reliable prediction of huma-virus PPIs (Yang et al. 2020b; Tsukiyama et al. 2021). Furthermore, several new residue representation methods based on Word2Vec have been proposed, such as Res2vec (Yao et al. 2019) and DL2vec (Chen et al. 2021).

Current issues

False positives, overfitting, and underfitting

Prediction models generated are usually evaluated with several performance measures to compare predicted scores with the actual observed ones. To minimize false positives and false negatives, it is necessary to use performance measures that can appropriately evaluate predictive performance, even though it is difficult to distinguish between false positives and novel PPIs.

Overfitting and underfitting issues can occur when generating predictive models. The performance of a predictive model generated by a learning algorithm depends on how well it captures the underlying features of the training dataset. When the algorithm is too simple and insufficient to model the training dataset, it is called underfitting, whereas when the algorithm is too complex and the training dataset are not sufficient to constrain it, it is called overfitting (Sarkar and Saha 2019). These issues can be solved by selecting an appropriate learning algorithm, performing appropriate evaluation with independent blind test datasets, and preparing a training dataset without missing underlying features of PPIs or bias. In terms of training dataset preparation, to solve the overfitting, it is still important to appropriately reduce the redundancy of homologous sequences present in the dataset. CD-HIT (Fu et al. 2012) is generally used for clustering and comparing protein sequences with several options, such as a sequence identity threshold or an alignment coverage, to reduce redundancies in the dataset. The main advantage of this program is that it is very fast and can handle extremely large datasets. Attention should be paid to the appropriate setting of the options used in CD-HIT. Datasets containing proteins with higher sequence similarity (e.g., > 30%) may lead to overfitting and overestimate the prediction performance. To avoid overfitting and develop reliable prediction models, it is expected to use low-sequence identity cut-off (< 30%) widely used in various sequence-based methods, shown in Table 2. On the other hand, to solve the underfitting, sufficient and up-to-date PPI data that reflect the features of target PPIs should be collected from the existing databases, shown in Table 1.

ML-based methods, realistic datasets

In the ML-based prediction methods, it is important to select the optimal ML approach. Currently, various ML-based methods have been developed that differ in terms of protein representation, method complexity, various protein features, and computational cost. DL-based methods have reported higher prediction performance than other methods but are computationally expensive.

Most methods have been developed on a balanced dataset containing equal numbers of interacting protein pairs and non-interacting protein pairs or an imbalanced dataset containing non-interacting pairs several times greater than the number of interacting pairs. However, this data balance, i.e., the ratio of interacting and non-interacting protein pairs, is highly imbalanced in nature and few methods have been developed or evaluated regarding the impact of realistic datasets containing huge numbers of non-interacting pairs. The realistic datasets should be huge, so handling such datasets requires more efficient algorithms for the prediction of PPIs.

Future perspectives and conclusion

Despite advances in techniques for large-scale experimental analysis of protein interactions, our knowledge of the whole set of PPIs in a particular cell is still incomplete, considering various physicochemical factors such as transient dynamics, post-translational modification (PTM), intrinsically disordered regions, and physiological conditions. For example, a comprehensive understanding of liquid–liquid phase separation (LLPS), which has received increasing attention over the past decade, should require the construction of complete PPI networks involving phase-separated proteins and analyze the network properties of different classes of phase-separating proteins in the human interactome. LLPS is an important mechanism that drives the formation of membrane-less organelles fundamentally driven by multivalent interactions between proteins and/or nucleic acids (Li et al. 2012; Mondal et al. 2022), which can occur in proteins between multiple folded domains or are medicated by intrinsically disordered proteins (Chu et al. 2022). For example, a mutation in the Speckle-type POZ protein (SPOP), a tumor-suppressor protein, can lead to the formation of many solid tumors, including prostate, gastric, and colorectal cancers (Wang et al. 2021). A recent study revealed that substrates of SPOP can phase-separate with SPOP to form condensates in vitro and co-localize in liquid nuclear organelles in cells (Bouchard et al. 2018). Therefore, identification of PPI networks requires further improvement to uncover all possible interactions that may exist at the same cellular localization or the proteome level. In addition, it is indispensable to develop computational methods to predict potential PPIs rapidly and accurately from a large number of candidate protein pairs using only sequence information. The development of such methods will enable comprehensive PPI prediction between proteins, leading to the identification for interacting partners of proteins of interest.

In general, many ML-based PPI prediction methods discriminate whether two proteins interact or not, given output scores from those ML models, but these scores do not explain the strength of the interactions of the two proteins. In order to further improve the reliability of PPI prediction and capture PPI network properties more clearly, protein binding affinities, which are typically measured by the equilibrium dissociation constant (Kd), should be considered to assess predicted interactions. However, the determination of the protein binding affinity is generally not applicable on a large scale due to the dissociation rate dependent accuracy of experimental methods, cost and time constraints, and the need for protein complexes (Abbasi et al. 2020). Therefore, accurate computational techniques can play an important role in the protein binding affinity determination. In particular, sequence-based protein binding affinity prediction is challenging but, like sequence-based PPI prediction, is also an important research topic (Yugandhar and Gromiha 2014; Abbasi et al. 2020). In addition, further development of sequence-based interaction site prediction is also important, as detailed interaction site and residue information enables more accurate PPI prediction, more accurate prediction of protein binding affinity, and more accurate analysis of PPI network properties.

Various PPI prediction methods have been proposed and available as a web application or on an Internet hosting service for software development like GitHub (Table 2). The availability of these methods as a public resource is of great benefit to the drug discovery community as well as to further advances in the PPI prediction. ML-based methods, especially, DL-based method, are currently great success in the PPI prediction. Further improvements to these methods include development with more realistic datasets and construction of large size and independent unbiased datasets to evaluate the proposed methods, because the performance of these methods essentially depends on reliable training data with known PPIs determined experimentally. In addition, it will be necessary to develop protein encoding methods that can better capture various protein features including functional, structural, and evolutional information.

As evidenced by the recent increase in the development of sequence-based PPI prediction methods based on heterogeneous datasets (Tables 2 and 3), these methods are readily available and indispensable for solving pressing questions, such as the mechanism of infection, without relying on structural information. In addition, sequence-based PPI prediction alone cannot surpass the reliability of structure-based methods, but by leveraging recently developed protein structure prediction methods such as AlphaFold 2.0, sequence-based PPI prediction overcome the weakness, which will be expected to further improve the reliability of PPI predictions.