Background

Pseudomonas syringae is a Gram-negative bacterium causing economically important diseases in a wide range of plant species leading to severe agricultural losses worldwide. Each strain of Pseudomonas shows a high degree of host specificity and infects only a limited number of plant species or even a few cultivars of a single plant species [1, 2]. Among them, pathovar tomato strain DC3000 (Pst DC3000) has been asserted to infect the plant host Arabidopsis thaliana and tomato causing bacterial spec and brown spot. Thus, Arabidopsis-Pseudomonas has been accepted as a model system for experimental characterization of the molecular dynamics of plant-pathogen interactions in both resistance and susceptible interactions [1, 3, 4]. The whole genome sequence of Pst DC3000 revealed that it has ~300 virulence-related genes [5]. One of the major classes of virulence factors includes effector proteins that are delivered into the host through a type III protein secretion system (TTSS) to suppress plant immune responses, and also to facilitate disease development [6]. Basically, Pseudomonas syringae pathogenesis is dependent on effector proteins and to date, nearly 60 different type III effector proteins encoded by hop genes have been identified [http://www.pseudomonas-syringae.org/]. In addition, Pst DC3000 also produces non-proteinaceous virulence effectors, including coronatine (COR), which are crucial for pathogenesis. However, the virulence function of a large number of potential effectors encoded by the Pst DC3000 genome and their mode of action is still unknown. Similarly, in Arabidopsis it has been seen that approximately 3000 proteins are directly related to plant defense [7]. Many of these proteins interact directly with the pathogen proteins and some of them initiate plant defense responses to the infection. Recently, Mukhtar et al. [8] reported an experimental protein interaction network (PPIN-1) containing 843 Arabidopsis proteins and 83 pathogen effectors including very few interactions with Pst DC3000. Till now, only nearly 10 % of the full genome of Arabidopsis has been evidenced for interaction. Therefore, to functionally characterize the dynamic interactions of plants with bacterial pathogens, there is a need for genome-wide study of the host-pathogen interactions. Knowledge of such novel resistance interactions provides the backbone of the understanding of plant resistance mechanisms and will aid in the further analysis of plant immunity [9].

Generally, pathogen attacks host tissues, secreting degradation enzymes and toxin release. Many of such mechanisms involve the protein-protein interactions (PPIs). PPIs are essential process in all living cells and play a crucial role in the infection process, and initiating a defense response. In this context, understanding the PPI network (interactome) between plant proteins and pathogen proteins is a critical step for studying the molecular basis of pathogenesis [10, 11]. In particular, computational approaches ameliorate the study of host-pathogen protein interactions in a genome-wide range.

In the past decade, a series of PPI prediction methods have been elegantly developed and are playing an increasingly important role in complementing experimental approaches. Diverse data types or properties, such as gene ontology (GO) annotations [12], protein sequence similarity [13], protein domain interactions [14], and protein structural information [15, 16] have been frequently utilized to construct PPI prediction methods. Among these computational methods, the interolog and the domain-based methods [1723] are widely used approaches for PPIs prediction.

In this work, we used the interolog and the domain-based methods to jointly predict the protein-protein interactions between Pseudomonas syringae and Arabidopsis thaliana. The domain-based approach infers inter-species protein-protein interactions by known domain-domain interactions from various databases and the interolog approach identifies protein-protein interactions based on homologous pairs of protein interactions across different organisms. We present the prediction pipeline in detail and the functional analysis of the predicted results.

Materials and methods

Data sources

The whole proteome of Pseudomonas syringae pv. tomato DC3000 is downloaded from Pseudomonas genome database (http://www.pseudomonas.com/download.jsp) which contains 5619 protein sequences. Similarly, the full genome of Arabidopsis thaliana containing 35386 protein sequences is extracted from the TAIR10 database (http://www.arabidopsis.org/). To infer the prediction from the interolog, we have used two types of datasets: the HPIDB dataset and DIP dataset. Database of Interacting Proteins (DIP) is a collection of experimental determined interactions between proteins in intra-species [24]. As of Jan 2014, DIP database contains 25749 sequences of 72380 protein-protein interactions. Host Pathogen Interaction Database (HPIDB) is a database of experimental determined interactions between 62 host and 529 pathogens [25]. As of Jan 2014, HPIDB database contains 29922 sequences of 23735 unique protein-protein interactions. To implement the domain based model, the domain-domain interaction databases, iPfam and 3DID are used. The iPfam database is a catalog of protein family interactions, including domain and ligand interactions, calculated from known structures in protein data bank (PDB). As of Jan 2014, the iPfam1.0 database contains 5442 domain-domain interactions. The database of three-dimensional interacting domains (3DID) is a collection of high-resolution three-dimensional structural templates for domain-domain interactions. It contains templates for interactions between two globular domains as well as novel domain-peptide interactions. As of Jan. 2014, the 3DID database contains 8323 domain-domain interactions.

Identification of secreted proteins in Pseudomonas syringae

All proteins of Pseudomonas are processed through the Psortb3.0 (widely used tool for protein localization in bacteria [26]) and those predicted as cytoplasmic or cytoplasmic membrane are discarded as these proteins have less chance of involvement in interaction. The rest proteins annotated with extracellular, outer membrane and unknown are considered to be positive candidates for interaction. Again we search the whole proteome of Pseudomonas through the effector database (http://www.effectors.org/) [27], which is an integrated database for secreted type proteins for bacteria. Those identified as secreted are considered as positive candidates for interaction. Combining these two steps, 2744 potential candidate proteins of PstDC3000 are filtered for interaction prediction.

Prediction of PPIs between Arabidopsis and Pseudomonas

In this study, the probability of interaction between an Arabidopsis and a Pseudomonas protein is inferred from two approaches: the domain based and the interolog method individually. The prediction framework is shown in Figure 1.

Figure 1
figure 1

Overall prediction framework of the interactions between Arabidopsis thaliana and Pseudomonas syringae.

Domain based protein-protein interaction prediction

The domain-based method uses domain interaction information, which is derived from known protein 3D structures, to infer the potential PPIs. If two proteins contain an interacting domain pair, it is expected that these two proteins may interact with each other. To get the domains in Arabidopsis and Pseudomonas, HMMPfam is used in interproscan5 [28]. In total, 49073 domains are extracted for all the Arabidopsis proteins and 7253 domains are collected for PstDC3000. If a protein pair between Pseudomonas and Arabidopsis contains an interacting domain pair from iPfam and 3DID, then the pair is expected to interact with each other.

Interolog based protein-protein interaction prediction

The interolog method relies on protein sequence similarity to conduct the PPI prediction. An interolog is a conserved interaction between a pair of proteins which have interacting homologs in another organism [29]. The illustration of interolog is shown in Figure 2. Consider that A and B are two different interacting proteins of one organism, and A' and B' are two different interacting proteins of another organism. Then the interaction between A and B is an interolog of the interaction between A' and B', if A is a homolog of A', B is a homolog of B', A and B interact, and A' and B' interact. Thus, interologs are homologous pairs of protein interactions across different organisms. Each protein in Arabidopsis and Pseudomonas is BLASTed against all the protein sequences in the DIP and HPIDB database to identify the homologs with E-value, sequence identity and aligned sequence length coverage of 1.0E-4, 50 and 80% respectively. Each protein pair between Pseudomonas and Arabidopsis is predicted to interact if an experimentally verified interaction exists between their respective homologous proteins in DIP or HPIDB databases.

Figure 2
figure 2

Illustration of protein-protein interologs. A and B are two different interacting proteins in one organism and A' and B' are two interacting proteins in another organism. Protein A-A' and B-B' are orthologs between the two organisms. Thus, protein pair A'-B' and A-B are interologs and conserved in the organisms.

Results and discussion

Prediction of interactions

To predict the genome wide interactions, all proteins of Arabidopsis and Pseudomonas are paired up, which constitute ~97M PPIs. The interaction probability of each pair is assessed through the domain-based model and interolog-based model separately. The predicted interactions from these methods are reported in Table 1. A total of ~0.86M probable PPIs are predicted from both the methods, which include ~14043 Arabidopsis proteins and 1337 Pesudomonas proteins. Out of these, 85650 PPIs are predicted by domain based method involving 11432 Arabidopsis and 887 Pseudomonas proteins. Similarly, the interolog method predicted ~0.79M PPIs including 7766 Arabidopsis and 1068 Pseudomonas proteins. Nearly, 11000 PPIs are consistently predicted by both methods as consensus which comprises 2043 Arabidopsis and 93 Pseudomonas proteins. The interaction network of the consensus predicted PPI is shown in Figure 3. On average, a Pseudomonas protein has around 118 Arabidopsis interacting partners, whereas an Arabidopsis protein interact with around 6 Pseudomonas proteins. The reported results are coherent with the previous studies in which it is demonstrated that a few pathogen proteins involved in interaction in the host interactome [11, 18, 19]. All predicted interactions from the domain based method, interolog method and the consensus predictions are available in Tables S1-S3 respectively in Additional files 1, 2 and 3.

Table 1 Prediction results of Arabidopsis and Pseudomonas syringae interactions using domain and Interolog approache s.
Figure 3
figure 3

Visualization of the predicted protein-protein interactions between Arabidopsis thaliana and Pseudomonas syringae. Each node represents a protein and each edge refers an interaction. Green color circles represent Arabidopsis and red color diamonds represent Pseudomonas. The network is generated using the Cytoscape tool.

Predicted effector hubs

The effectors of Pseudomonas with the highest number of edges (hubs) are PSPTO_0135, PSPTO_0400, PSPTO_0540, PSPTO_0808, PSPTO_1510, PSPTO_2303, PSPTO_2529, PSPTO_2632, PSPTO_3161, PSPTO_3583, PSPTO_3890, PSPTO_3912 and PSPTO_4001 with more than 400 PPIs in the Arabidopsis-Pseudomonas interactome. There are also several effectors with more than 40 predicted PPIs. These are PSPTO_4497, PSPTO_1482, PSPTO_4868, PSPTO_4602, PSPTO_3882, PSPTO_0405, PSPTO_1492, PSPTO_4093, PSPTO_1949, PSPTO_4776, PSPTO_3130, PSPTO_3900, PSPTO_5014 and PSPTO_4090. In contrast to these hub proteins, several effectors are predicted to interact with very few proteins. These hub proteins play important role in pathogenesis, hence can be further investigated for deciphering virulence mechanism.

Functional enrichment analysis of proteins involved in the Interaction

Functional enrichment analysis is an important assessment for elucidating the functional relevance of the host and pathogen proteins involved in the PPIs. The presence of enriched (over-represented) functional categories that are closely related to host defense and pathogen infection support the validity of the predicted PPIs of the prediction models. Gene ontology (GO) is a comprehensive functional system to annotate the gene products. We used the biological process GO term enrichment to see the relevance of the predicted proteins. The Database for Annotation, Visualization and Integrated Discovery (DAVID) is used to conduct the enrichment analysis[30]. The over represented biological processes of Arabidopsis and Pseudomonas proteins in the predicted PPIs are listed in Tables 2 and 3 respectively. The enrichment analysis in Arabidopsis shows that many proteins involved in the biological process, response to cadmium ion and metal ion. In literature, it has been shown that metal ions are required for pathogen virulence and plant defense [31, 32]. Fones et al. demonstrated Zn, Ni or Cd are accumulated when Thlaspi caerule resist to a leaf spot caused by Pseudomonas syringae pv. maculicola [31]. Block and James reveal that the plant immune responses include deposition of lignin and callose in the cell wall and production of reactive oxygen species and anti-microbial compounds [33]. Qiu et al. [34] show that MAPK/ERK Kinase may directly or indirectly act through another signaling cascade to activate a transcription factor. The transcription factor will then bind a particular region of DNA, resulting in the recruitment of RNA polymerase to transcribe a gene that will ultimately contribute to altering the function of the cell and cause pathogenesis[35]. These evidences in literature support our predicted results.

Table 2 Enriched GO biological process terms involved in predicted Arabidopsis protein s.
Table 3 Enriched GO biological process terms involved in predicted Pseudomonas syringae protein s.

Subcellular localization of Arabidopsis proteins targeted by the predicted Pseudomonas proteins

Pathogens suppress host immunity by directing a range of secreted proteins or effectors, to the cytoplasm of host cells. Once these effector proteins traversed the host plasma-membrane, are transported to many subcellular locations where they subvert the host immune system to enable pathogen growth and reproduction. The knowledge of cellular compartments of the Arabidopsis proteins targeted by the predicted Pseudomonas will be helpful in deciphering the mechanism of host-pathogen interactions. If the targeted Arabidopsis proteins are located in cellular compartments that are very relevant to the pathogen's infection or very likely to be involved in interactions with the pathogen, then the prediction result supports the host-pathogen predictions.

To have a clear understanding the location of the interactions in host, we extracted the subcellular localization of the predicted Arabidopsis proteins from both the domain based and interolog methods using the AtSubP [36] available in TAIR database. To date, AtSubP is the only tool for subcellular location prediction of Arabidopsis proteins on a genome-scale with high accuracy for seven locations. The subcellular locations of all predicted Arabidopsis proteins are listed in Table 4. We found that 29% host proteins are localized in nucleus, 9% in extracellular, 10% in chloroplast, 16% in cytoplasm, 10% in cell membrane, 1% in Golgi, 5% in mitochondrion and 20% as unknown. It reveals that major of the interactions occur in nucleus, cytoplasm, chloroplast and plasma membrane region. In a recent review by Block and James [33] shows that the effectors of Pseudomonas syringae target the plant proteins mostly in plasma membrane, chloroplast and mitochondrion. Citovsky et al. [37] showed that when Agrobacterium tumefaciens interact with A. thaliana, it hijacks VIP1 protein and use it to shuttle transfer-DNA (T-DNA) into the nucleus for its reproduction. Tao et al. investigated that TIP, an Arabidopsis protein, interacts with the coat protein (CP) of Turnip crinkle virus (TCV) in yeast cells in nuclei [38]. Thus, the predicted locations of the interacting Arabidopsis proteins by our approach are in close agreement with the earlier findings. Also the localizations for a large number of proteins are still unknown which need a special attention for experimental characterization.

Table 4 Distribution of subcellular localization for predicted interacting proteins in Arabidopsis thaliana from both the domain and interolog-based approache s.

Conclusion

In this study, we have demonstrated that the sequence and domain similarity to known interactions are valuable information in predicting the host-pathogen interactions. We identified ~11000 PPIs between Arabidopsis thaliana and Pseudomonas syringae pv. tomato DC3000 based on the domain-based and interolog approaches. The functional annotations of both Arabidopsis and Pseudomonas proteins involved in the predicted PPI are analyzed and it shows the relevance of the proteins for host defense and pathogen infections. The present work may provide some useful information and resource to the plant community to understand the molecular mechanism of the plant immunity system against pathogen virulence. The quality of the predicted interactome could further be improved by combining these methods with other computational approaches and biological data sources. The reliability of the predicted interactions can be further assessed through experimental validations.