On the Statistics of Identifying Candidate Pathogen Effectors

  • Leighton PritchardEmail author
  • David Broadhurst
Part of the Methods in Molecular Biology book series (MIMB, volume 1127)


High-throughput sequencing is an increasingly accessible tool for cataloging gene complements of plant pathogens and their hosts. It has had great impact in plant pathology, enabling rapid acquisition of data for a wide range of pathogens and hosts, leading to the selection of novel candidate effector proteins, and/or associated host targets (Bart et al., Proc Nat Acad Sci U S A doi:10.1073/pnas.1208003109, 2012; Agbor and McCormick, Cell Microbiol 13:1858–1869, 2011; Fabro et al., PLoS Pathog 7:e1002348, 2011; Kim et al., Mol Plant Pathol 2:715–730, 2011; Kimbrel et al., Mol Plant Pathol 12:580–594, 2011; O’Brien et al., Curr Opin Microbiol 14:24–30, 2011; Vleeshouwers et al., Annu Rev Phytopathol 49:507–531, 2011; Sarris et al., Mol Plant Pathol 11:795–804, 2010; Boch and Bonas, Annu Rev Phytopathol 48:419–436, 2010; Mcdermott et al., Infect Immun 79:23–32, 2011).

Identification of candidate effectors from genome data is not different from classification in any other high-content or high-throughput experiment. The primary aim is to discover a set of qualitative or quantitative sequence characteristics that discriminate, with a defined level of certainty, between proteins that have previously been identified as being either “effector” (positive) or “not effector” (negative). Combination of these characteristics in a mathematical model, or classifier, enables prediction of whether a protein is or is not an effector, with a defined level of certainty. High-throughput screening of the gene complement is then performed to identify candidate effectors; this may seem straightforward, but it is unfortunately very easy to identify seemingly persuasive candidate effectors that are, in fact, entirely spurious.

The main sources of danger in this area of statistical modeling are not entirely independent of each other, and include: inappropriate choice of classifier model; poor selection of reference sequences (known positive and negative examples); poor definition of classes (what is, and what is not, an effector); inadequate training sample size; poor model validation; and lack of adequate model performance metrics (Xia et al., Metabolomics doi:10.1007/s11306-012-0482-9, 2012). Many studies fail to take these issues into account, and thereby fail to discover anything of true significance or, worse, report spurious findings that are impossible to validate. Here we summarize the impact of these issues and present strategies to assist in improving design and evaluation of effector classifiers, enabling robust scientific conclusions to be drawn from the available data.

Key words

Effectors Statistical modeling Classification Bioinformatics Sequence analysis Genomics High-throughput screening 


  1. 1.
    Bart R, Cohn M, Kassen A, McCallum EJ, Shybut M et al (2012) High-throughput genomic sequencing of cassava bacterial blight strains identifies conserved effectors to target for durable resistance. Proc Natl Acad Sci U S A. doi: 10.1073/pnas.1208003109 Google Scholar
  2. 2.
    Agbor TA, McCormick BA (2011) Salmonella effectors: important players modulating host cell function during infection. Cell Microbiol 13:1858–1869. doi: 10.1111/j.1462-5822.2011.01701.x Google Scholar
  3. 3.
    Fabro G, Steinbrenner J, Coates M, Ishaque N, Baxter L et al (2011) Multiple candidate effectors from the oomycete pathogen Hyaloperonospora arabidopsidis suppress host plant immunity. PLoS Pathog 7:e1002348. doi: 10.1371/journal.ppat.1002348
  4. 4.
    Kim J-G, Taylor KW, Mudgett MB (2011) Comparative analysis of the XopD type III secretion (T3S) effector family in plant pathogenic bacteria. Mol Plant Pathol 12:715–730. doi: 10.1111/j.1364-3703.2011.00706.x PubMedCentralPubMedCrossRefGoogle Scholar
  5. 5.
    Kimbrel JA, Givan SA, Temple TN, Johnson KB, Chang JH (2011) Genome sequencing and comparative analysis of the carrot bacterial blight pathogen, Xanthomonas hortorum pv. carotae M081, for insights into pathogenicity and applications in molecular diagnostics. Mol Plant Pathol 12:580–594. doi: 10.1111/j.1364-3703.2010.00694.x
  6. 6.
    O'Brien HE, Desveaux D, Guttman DS (2011) Next-generation genomics of Pseudomonas syringae. Curr Opin Microbiol 14:24–30. doi: 10.1016/j.mib.2010.12.007 Google Scholar
  7. 7.
    Vleeshouwers VGAA, Raffaele S, Vossen JH, Champouret N, Oliva R et al (2011) Understanding and exploiting late blight resistance in the age of effectors. Annu Rev Phytopathol 49:507–531. doi: 10.1146/annurev-phyto-072910-095326 PubMedCrossRefGoogle Scholar
  8. 8.
    Sarris PF, Skandalis N, Kokkinidis M, Panopoulos NJ (2010) In silico analysis reveals multiple putative type VI secretion systems and effector proteins in Pseudomonas syringae pathovars. Mol Plant Pathol 11:795–804. doi: 10.1111/j.1364-3703.2010.00644.x Google Scholar
  9. 9.
    Boch J, Bonas U (2010) Xanthomonas AvrBs3 family-type III effectors: discovery and function. Annu Rev Phytopathol 48:419–436. doi: 10.1146/annurev-phyto-080508-081936 Google Scholar
  10. 10.
    Mcdermott JE, Corrigan A, Peterson E, Oehmen C, Niemann G et al (2011) Computational prediction of type III and IV secreted effectors in gram-negative bacteria. Infect Immun 79:23–32. doi: 10.1128/IAI.00537-10 PubMedCentralPubMedCrossRefGoogle Scholar
  11. 11.
    Xia J, Broadhurst DI, Wilson M, Wishart DS (2012) Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics. doi: 10.1007/s11306-012-0482-9 PubMedCentralPubMedGoogle Scholar
  12. 12.
    Cornelis GR (2006) The type III secretion injectisome. Nat Rev Microbiol 4:811–825. doi: 10.1038/nrmicro1526 PubMedCrossRefGoogle Scholar
  13. 13.
    Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG et al (2007) A translocation signal for delivery of oomycete effector proteins into host plant cells. Nature 450:115–118. doi: 10.1038/nature06203 PubMedCrossRefGoogle Scholar
  14. 14.
    Löwer M, Schneider G (2009) Prediction of type III secretion signals in genomes of gram-negative bacteria. PLoS ONE 4:e5917. doi: 10.1371/journal.pone.0005917 PubMedCentralPubMedCrossRefGoogle Scholar
  15. 15.
    Arnold R, Brandmaier S, Kleine F, Tischler P, Heinz E et al (2009) Sequence-based prediction of type III secreted proteins. PLoS Pathog 5:e1000376. doi: 10.1371/journal.ppat.1000376 PubMedCentralPubMedCrossRefGoogle Scholar
  16. 16.
    Sui T, Yang Y, Wang X (2013) Sequence-based feature extraction for type III effector prediction. Int J Biosci Biochem Bioinforma 3:246–251. doi: 10.7763/IJBBB.2013.V3.206 Google Scholar
  17. 17.
    Liu C, Che D, Liu X, Song Y (2013) Applications of machine learning in genomics and systems biology. Comput Math Methods Med 2013:587492. doi: 10.1155/2013/587492
  18. 18.
    Broadhurst D, Kell DB (2006) Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2:171–196CrossRefGoogle Scholar
  19. 19.
    O'Brien HE, Thakur S, Gong Y, Fung P, Zhang J et al (2012) Extensive remodeling of the Pseudomonas syringae pv. avellanae type III secretome associated with two independent host shifts onto hazelnut. BMC Microbiol 12:141Google Scholar
  20. 20.
    McNally RR, Toth IK, Cock PJA, Pritchard L, Hedley PE et al (2012) Genetic characterization of the HrpL regulon of the fire blight pathogen Erwinia amylovora reveals novel virulence factors. Mol Plant Pathol 13:160–173. doi: 10.1111/j.1364-3703.2011.00738.x
  21. 21.
    Arnold DL, Jackson RW (2011) Bacterial genomes: evolution of pathogenicity. Curr Opin Plant Biol 14:385–391. doi: 10.1016/j.pbi.2011.03.001 PubMedCrossRefGoogle Scholar
  22. 22.
    Haas BJ, Kamoun S, Zody MC, Jiang RHY, Handsaker RE et al (2009) Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 461:393–398. doi: 10.1038/nature08358 Google Scholar
  23. 23.
    Win J, Morgan W, Bos JIB, Krasileva KV, Cano LM et al (2007) Adaptive evolution has targeted the C-terminal domain of the RXLR effectors of plant pathogenic oomycetes. Plant Cell 19:2349–2369. doi: 10.1105/tpc.107.051037 PubMedCentralPubMedCrossRefGoogle Scholar
  24. 24.
    Bhattacharjee S, Hiller NL, Liolios K, Win J, Kanneganti T-D et al (2006) The malarial host-targeting signal is conserved in the Irish potato famine pathogen. PLoS Pathog 2:e50. doi: 10.1371/journal.ppat.0020050 PubMedCentralPubMedCrossRefGoogle Scholar
  25. 25.
    Petnicki-Ocwieja T, Schneider DJ, Tam VC, Chancey ST, Shan L et al (2002) Genomewide identification of proteins secreted by the Hrp type III protein secretion system of Pseudomonas syringae pv. tomato DC3000. Proc Natl Acad Sci U S A 99:7652–7657. doi: 10.1073/pnas.112183899 PubMedCentralPubMedCrossRefGoogle Scholar
  26. 26.
    Greenberg JT, Vinatzer B (2003) Identifying type III effectors of plant pathogens and analyzing their interaction with plant cells. Curr Opin Microbiol 6(1):20–28PubMedCrossRefGoogle Scholar
  27. 27.
    Bogdanove AJ, Schornack S, Lahaye T (2010) TAL effectors: finding plant genes for disease and defense. Curr Opin Plant Biol 13: 394–401. doi: 10.1016/j.pbi.2010.04.010 PubMedCrossRefGoogle Scholar
  28. 28.
    Boch J, Scholze H, Schornack S, Landgraf A, Hahn S et al (2009) Breaking the code of DNA-binding specificity of TAL-type III effectors. Science. doi: 10.1126/science.1178811 PubMedGoogle Scholar
  29. 29.
    Yang Y (2012) Identification of novel type III effectors using latent Dirichlet allocation. Comput Math Methods Med 2012:696190. doi: 10.1155/2012/696190 PubMedCentralPubMedGoogle Scholar
  30. 30.
    Wang Y, Zhang Q, Sun M-A, Guo D (2011) High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics 27:777–784. doi: 10.1093/bioinformatics/btr021 PubMedCrossRefGoogle Scholar
  31. 31.
    Macho AP, Ruiz-Albert J, Tornero P, Beuzón CR (2009) Identification of new type III effectors and analysis of the plant response by competitive index. Mol Plant Pathol 10:69–80. doi: 10.1111/j.1364-3703.2008.00511.x Google Scholar
  32. 32.
    Xu S, Zhang C, Miao Y, Gao J, Xu D (2010) Effector prediction in host-pathogen interaction based on a Markov model of a ubiquitous EPIYA motif. BMC Genomics 11(Suppl 3):S1. doi: 10.1186/1471-2164-11-S3-S1 PubMedCentralPubMedCrossRefGoogle Scholar
  33. 33.
    Jehl M-A, Arnold R, Rattei T (2010) Effective – a database of predicted secreted bacterial proteins. Nucleic Acids Res. doi: 10.1093/nar/gkq1154 Google Scholar
  34. 34.
    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182Google Scholar
  35. 35.
    Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517. doi: 10.1093/bioinformatics/btm344 PubMedCrossRefGoogle Scholar
  36. 36.
    Eriksson L, Johansson E, Kettaneh-Wold N, Wold S (2001) Multi- and megavariate data analysis: principles and applications. Umetrics AB, UmeaGoogle Scholar
  37. 37.
    Brereton RG (2003) Chemometrics: data analysis for the laboratory and chemical plant. Wiley, Chichester UKCrossRefGoogle Scholar
  38. 38.
    Efron B, Tibshirani R (1997) Improvements on cross-validation: the .632+ bootstrap method. J Am Stat Assoc 92:548–560. doi: 10.1080/01621459.1997.10474007 Google Scholar
  39. 39.
    Obuchowski NA, Lieber ML, Wians FH (2004) ROC curves in clinical chemistry: uses, misuses, and possible solutions. Clin Chem 50:1118–1125. doi: 10.1373/clinchem.2004.031823 Google Scholar
  40. 40.
    Zweig MH, Campbell G (1993) Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 39(4):561–577PubMedGoogle Scholar
  41. 41.
    Lasko TA, Bhagwat JG, Zou KH (2005) The use of receiver operating characteristic curves in biomedical informatics. J Biomed Inform 38(5):404–415PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, New York 2014

Authors and Affiliations

  1. 1.Information and Computational Sciences, The James Hutton InstituteInvergowrieUK
  2. 2.Department of MedicineKatz Group Centre for Pharmacy & Health, University of AlbertaEdmontonCanada

Personalised recommendations