ImitateDB starts with all experimentally determined HP-PPIs. The first interactors of the host proteins were added. These experimentally determined host and pathogen proteins are more likely to be co-expressed and co-localized hence increasing the confidence in the identification of mimicry pairs. The DMPs and MMPs provide information about the matched domains and motifs between the pathogen proteins and first interactor proteins of the respective interacting host proteins. The information in the database is provided for different categories of pathogen like virus, bacteria, and fungi. The pathogens belonging to Protozoa and Amoebozoa are found under the “Others” category. Figure 3 shows the database schema depicting the pipeline followed for all the search options, and the workflow for the development of the database along with the frequency of primary entities.
Experimental HP-PPI data
Viruses have the highest number of reported HP-PPIs among the different pathogen categories. 5568 pathogen proteins from 629 organisms interacted with 10,078 host proteins with 61,214 HP-PPIs. Of these, 49,249 reported HP-PPIs were of viral origin. In comparison, reported bacterial HP-PPIs were 10,080 while those from other organisms were even fewer. Further, 11,657 host first interactors having 1,03,120 interactions with the host proteins were retrieved as described in the methodology. Domains and motifs of these first interactors were identified and compared with those present in the pathogen proteins. The total as well as the number of unique domains (each corresponding to a unique PSSM-ID) and motifs (each corresponding to unique ScanProsite motif ID) annotated in pathogen proteins and host interactor proteins are shown in (Table 2).
MMPs are more numerous in comparison with DMPs
Out of the 5568 pathogen proteins from 629 pathogens, a total of 5254 proteins from 611 pathogens had similar domains or motifs as host interactor proteins (Table 2). The DMPs in the entire database were found to be 1,97,607 whereas the MMPs were found to 32,67,568. The number of DMPs, MMPs, total HP-PPIs, HP-PPIs characterized by mimicked domain and motif for each pathogen category are listed in (Table 3). Viruses showed the highest number of DMPs and MMPs, likely to be due to the preponderance of virus HP-PPIs in the data.
Interestingly, of the total 61,214 HP-PPIs reported, only 1551 were found to be characterized by domain mimicry whereas 49,265 were found to be characterized by mimicked motifs. The total number of HP-PPIs, the fraction of HP-PPIs characterized by mimicked domains and by motifs were compared across pathogen categories and are shown in Supplementary Figure S1. Motif mimicry dominates in number over domain mimicry across all pathogen categories. Previous studies have also reported the extensive use of motif mimicry by viral proteomes (Davey et al. 2011; Duro et al. 2015; Via et al. 2015; Garamszegi et al. 2013). However, due to the short and ambiguous nature of the motifs, some false positives are also expected. Therefore, the detected MMPs need to be carefully examined and validated in future works.
Table 2 shows the number of pathogen proteins and host interactor proteins involved in domain mimicry or being shared between the pathogen and host interactor proteins. As there were multiple instances of every mimicked domain, we looked for unique domains. There were 4300 unique domains shared by the pathogen and host first interactor proteins. The largest number of DMPs were found for the protein Serine Threonine Protein Kinase US3 (UniProt ID: P04413) from human herpesvirus 1 strain 17 (HHV-1). It forms 61,609 DMPs predominantly consisting of STKc_MST3_like domains (PSSM-ID: 270,786). The top 10 pathogens involved in domain mimicry along with the number of DMPs are shown in Supplementary Table S1.
The top 10 most frequent mimicked domains are shown in Supplementary Figure S2 (a). PHA03247 (large tegument protein UL36 domain family) was the most frequent among DMPs. UL36 is an important domain family that is crucial for virus host interaction and host immune evasion (Newcomb and Brown 2010). UL36 is found to be colocalized with host/viral membrane proteins. It aids in the assembly and cell entry of Herpes Simplex Virus (Schipke et al. 2012). The top 10 most frequently occurring mimicked domains in different pathogen categories are shown in Supplementary Table S2. Of these, the DEAD-like helicase domain superfamily was frequent among viral DMPs and has been previously reported as an emerging class of host domains mimicked by viral pathogens (Meier-Stephenson et al. 2018).
In case of bacteria, viruses and fungi, Rad50 ATPase and SbcC domains were found to be commonly mimicked domains. Both these domains are highly conserved among eukaryotes (humans and fungi), bacteria and viruses, (Cromie and J.C.C., D R Leach 2001; Yoshida et al. 2011) and have been involved in disrupting the host DNA repair pathways (Gagnaire et al. 2017; Lilley et al. 2007). The pathogens with the highest number of DMPs and MMPs in different pathogen categories, i.e., virus, bacteria, fungi, and others are listed in Supplementary data Tables S3, S4, S5 and S6 respectively.
Table 2 shows the number of pathogen proteins and host interactor proteins involved in motif mimicry along with the number of unique motifs being mimicked or being shared between the pathogen and host interactor proteins. As there were multiple instances of every mimicked motif, we looked for unique motifs. There were only 96 unique motifs shared by pathogen and host first interactor protein. The largest number of MMPs were found for the Polymerase basic protein 2 from influenza A virus strain A/Wilson-Smith/1933 H1N1. It forms 35,385 MMPs predominantly containing of Protein kinase C or PKC_PHOSPHO_SITE motifs (ScanProsite Motif ID: PS00005). The top 10 pathogens by the count of MMPs are listed in Supplementary Table S7. It was observed that Saccharomyces cerevisiae S288c had the maximum count of DMPs and MMPs though the total number of reported HP-PPIs were very low in comparison with virus or bacteria. The genes that regulate cellular processes in humans have equivalents that control cell division in yeasts, thus facilitating alteration of the host cellular machinery (Cazzanelli et al. 2018).
S. cerevisiae is an opportunistic pathogen as it is found to be associated with cutaneous infections, systemic bloodstream infections and infections of essential organs in immunocompromised or critically ill patients (Perez-Torrado and Querol 2015). Escherichia coli K12 is another opportunistic pathogen in our dataset as it is found to switch over its otherwise dormant pathogenic machinery to exert ill effects on the host under specific conditions such as dysbiosed gut microbiome composition, compromised immune system or a lack of gut microbe competition(Bhat et al. 2019). Thus, HP-PPIs, DMPs and MMPs from these organisms are of interest for the role of mimicry proteins in hijacking of human pathways and development of therapeutics against it.
The total count for the top 10 most frequently occurring motifs in the database is shown in Supplementary Figure S2(b), which indicates the predominance of phosphorylation sites for PKC and casein kinase II (CK2). PKC and CK2 family of serine/threonine kinases plays essential roles in hijacking multiple signalling pathways in humans leading to many viral infections (Keating and Striker 2012). Sites for N-myristoylation, amidation, and N-glycosylation were amongst the most frequently mimicked motifs. N-glycosylation site is a frequently occurring motif used by several pathogen proteins (especially viral glycoproteins) to evade the human immune system (Crispin and Doores 2015; Crispin et al. 2018). The envelope proteins of viruses like HIV-1 are heavily glycosylated and can provide camouflage against the human proteins, leading to alteration of immune recognition (Wagh et al. 2018; Seabright et al. 2019). N-myristoylation motifs, post translational modification sites that have prominent roles in cellular signalling pathways, have been found to be mimicked by viral and bacterial proteins (Davey et al. 2011; Maurer-Stroh and Eisenhaber 2004). A comparative view of the top 10 most frequently mimicked motifs amongst the different pathogen categories is shown in Supplementary Table S8. Additionally, several other commonly mimicked motifs in our data are ABC transporter family signature motif, Q motif, ATP/GTP-binding site motif A (P-loop), arginine-rich motif, ubiquitination site and prenyl group binding site. The number of top 20 mimicked motifs for the top 20 pathogens is shown in Supplementary Table S9.
Mimicry pairs in highly interacting pathogen proteins and host proteins
Several previous studies have shown that essentiality and pathogen fitness are correlated with high number of interactions (Crua Asensio et al. 2017; Ahmed et al. 2018). Therefore, the number of DMPs and MMPs in the top 10 highly interacting pathogen proteins and host proteins were examined (Supplementary Tables S10 and S11, respectively). The top 10 highly interacting pathogen proteins were of viral origin and predominantly formed MMPs. Among host proteins, a few had a very high number of DMPs. It was observed that nuclear factor NF-kappa-B p105 subunit was a part of 491, while Cellular tumor antigen p53 was a part of 180 DMPs.
Chemokine and cytokine, cholecystokinin receptor, epidermal growth factor receptor and platelet-derived growth factor signalling pathways are enriched in host proteins of mimicry pairs
The enriched pathways and processes of the host proteins involved in DMPs and MMPs were annotated. Apart from specific pathways for some autoimmune diseases such as Huntington and Parkinson disease, chemokine and cytokine, cholecystokinin receptor, epidermal growth factor receptor, platelet-derived growth factor signalling pathways, T cell and B-cell activation pathways were enriched among the host proteins constituting the DMPs and MMPs. The enriched pathways along with their corrected p values are listed in supplementary table S12. Similarly, the enriched gene ontologies were determined for the host proteins. The enriched cellular compartments, molecular functions, and biological processes of the host proteins along with their corrected p values are shown in Supplementary tables S13, S14 and S15, respectively.
Selected novel domain and motif mimicry candidates
Several novel candidate mimicry domains like SANT, TCP-1 and Tudor in pathogens were identified from analysis of the ImitateDB data. Some of the novel domain mimics identified in different pathogens along with their functions are shown in Supplementary Table S12. Microbodies C-terminal targeting signal, lipocalin signature and cornichon signature were novel motif mimic candidates. Selected novel mimicry motifs identified in different pathogens along with their functions are shown in Supplementary Table S13.
The ImitateDB web interface
The web interface for the ImitateDB database provides a user-friendly access to the data and allows the user to search for information about DMPs and MMPs using multiple search options. The web interface has a home page that gives an overview of molecular mimicry and procedure for determination of DMPs/MMPs. The interface provides a separate search page to query for domain and motif to search for DMPs and MMPs, respectively. After choosing between domain or motif, the interface provides the next menu to choose among the different categories of pathogens, namely virus, bacteria, fungi and others. In each category, the database can be searched by organism, pathogen protein ID, host protein ID, interaction detection method, host interactor protein ID, matched domain PSSM ID, domain short name, matched motif ID, motif name, or pattern. For easier searching, selection of the category and subcategory leads to the population of a drop-down menu with available options. Additionally, the user can enter a keyword to retrieve the required data. After this selection, the user needs to enter the correct captcha to fetch the results.
The website also has an instructions manual to help the users to query the database and an interpretation manual to help the users to interpret the results. An expanded view of the search panel in the database is shown in Fig. 4a. The results are displayed in the form of a table that can be downloaded. The results are externally integrated using hyperlinked domain PSSM ID, ScanProsite motif ID, protein ID and PubMed ID. The download feature for bulk files has been restricted due to download constraints. For queries yielding results between 10,000 and 1,00,000 records, the user is provided with the results by email using an in-built mailer (as shown in Fig. 4b), that pops up after clicking the download button. The users are advised to check the spam folder for mails from ImitateDB. For queries yielding results above 1,00,000 records, the user can contact the ImitateDB team to obtain the results.
Literature validation of mimicry using selected examples
The following selected examples from ImitateDB could be validated from the literature as instances of domains/motifs that are mimicked by the pathogen to bind to the host and also serve as the site of interaction:
DMP: ImitateDB determined a DMP consisting of protein K3L (UniProt ID: P20639) of Vaccinia Virus which mimics the S1 domain of human eIF-2-α (UniProt ID: P05198) to interact with the human PRK (UniProt ID: P19525). Competitive binding experiments suggested that PKR recognizes and interacts with K3L and eIF2α by a common mode due to homology between the S1 domain of K3L and eIF2α (Sharp et al. 1997; Dar and Sicheri 2002; Beattie and T.J., Paoletti E 1991).
MMP: The ImitateDB MMP consisting of protein BHRF1(UniProt ID: P0C6Z1) of Epstein Barr virus (EBV) that mimics the BH2 motif of human Bcl-xl (UniProt ID: Q07817) to interact with the human BAK1(UniProt ID: Q16611). The BHRF1–Bak complex (PDB ID: 2xpx) and Bcl-xl–Bak complex (PDB ID: 1bxl) shows the competitive mode of binding of BHRF1 and Bcl-xl through the BH2 motif on BH3 peptide of BAK (Kvansakul and Hinds 2013).
Other modes of mimicry-mediated binding
There can be instances where the mimicked domain/motif between the interacting host and pathogen protein is not the actual site of interaction. As an example, ImitateDB contains LegAS4 protein (UniProt ID: Q5ZUS4) of L. pneumophila that competes with human histone methyltransferase/H3K9 (UniProt ID: O43463) to bind to human HP1γ (UniProt ID: Q13185), a possible transcriptional repressor of heterochromatin-like complexes. The SET domain of H3K9 is structurally mimicked. However, the binding at HP1γ is not through the SET domain. SET domain mimicry is used by LegAS4 to target human heterochromatin-1 to activate host rDNA transcription as proved through chromatin immunoprecipitation assays (Li et al. 2013). Therefore, in certain cases, the mimicry in the DMP/MMP may be incidental.