On the Statistics of Identifying Candidate Pathogen Effectors
High-throughput sequencing is an increasingly accessible tool for cataloging gene complements of plant pathogens and their hosts. It has had great impact in plant pathology, enabling rapid acquisition of data for a wide range of pathogens and hosts, leading to the selection of novel candidate effector proteins, and/or associated host targets (Bart et al., Proc Nat Acad Sci U S A doi:10.1073/pnas.1208003109, 2012; Agbor and McCormick, Cell Microbiol 13:1858–1869, 2011; Fabro et al., PLoS Pathog 7:e1002348, 2011; Kim et al., Mol Plant Pathol 2:715–730, 2011; Kimbrel et al., Mol Plant Pathol 12:580–594, 2011; O’Brien et al., Curr Opin Microbiol 14:24–30, 2011; Vleeshouwers et al., Annu Rev Phytopathol 49:507–531, 2011; Sarris et al., Mol Plant Pathol 11:795–804, 2010; Boch and Bonas, Annu Rev Phytopathol 48:419–436, 2010; Mcdermott et al., Infect Immun 79:23–32, 2011).
Identification of candidate effectors from genome data is not different from classification in any other high-content or high-throughput experiment. The primary aim is to discover a set of qualitative or quantitative sequence characteristics that discriminate, with a defined level of certainty, between proteins that have previously been identified as being either “effector” (positive) or “not effector” (negative). Combination of these characteristics in a mathematical model, or classifier, enables prediction of whether a protein is or is not an effector, with a defined level of certainty. High-throughput screening of the gene complement is then performed to identify candidate effectors; this may seem straightforward, but it is unfortunately very easy to identify seemingly persuasive candidate effectors that are, in fact, entirely spurious.
The main sources of danger in this area of statistical modeling are not entirely independent of each other, and include: inappropriate choice of classifier model; poor selection of reference sequences (known positive and negative examples); poor definition of classes (what is, and what is not, an effector); inadequate training sample size; poor model validation; and lack of adequate model performance metrics (Xia et al., Metabolomics doi:10.1007/s11306-012-0482-9, 2012). Many studies fail to take these issues into account, and thereby fail to discover anything of true significance or, worse, report spurious findings that are impossible to validate. Here we summarize the impact of these issues and present strategies to assist in improving design and evaluation of effector classifiers, enabling robust scientific conclusions to be drawn from the available data.
Key wordsEffectors Statistical modeling Classification Bioinformatics Sequence analysis Genomics High-throughput screening
- 3.Fabro G, Steinbrenner J, Coates M, Ishaque N, Baxter L et al (2011) Multiple candidate effectors from the oomycete pathogen Hyaloperonospora arabidopsidis suppress host plant immunity. PLoS Pathog 7:e1002348. doi: 10.1371/journal.ppat.1002348
- 5.Kimbrel JA, Givan SA, Temple TN, Johnson KB, Chang JH (2011) Genome sequencing and comparative analysis of the carrot bacterial blight pathogen, Xanthomonas hortorum pv. carotae M081, for insights into pathogenicity and applications in molecular diagnostics. Mol Plant Pathol 12:580–594. doi: 10.1111/j.1364-3703.2010.00694.x
- 17.Liu C, Che D, Liu X, Song Y (2013) Applications of machine learning in genomics and systems biology. Comput Math Methods Med 2013:587492. doi: 10.1155/2013/587492
- 19.O'Brien HE, Thakur S, Gong Y, Fung P, Zhang J et al (2012) Extensive remodeling of the Pseudomonas syringae pv. avellanae type III secretome associated with two independent host shifts onto hazelnut. BMC Microbiol 12:141Google Scholar
- 20.McNally RR, Toth IK, Cock PJA, Pritchard L, Hedley PE et al (2012) Genetic characterization of the HrpL regulon of the fire blight pathogen Erwinia amylovora reveals novel virulence factors. Mol Plant Pathol 13:160–173. doi: 10.1111/j.1364-3703.2011.00738.x
- 25.Petnicki-Ocwieja T, Schneider DJ, Tam VC, Chancey ST, Shan L et al (2002) Genomewide identification of proteins secreted by the Hrp type III protein secretion system of Pseudomonas syringae pv. tomato DC3000. Proc Natl Acad Sci U S A 99:7652–7657. doi: 10.1073/pnas.112183899 PubMedCentralPubMedCrossRefGoogle Scholar
- 34.Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182Google Scholar
- 36.Eriksson L, Johansson E, Kettaneh-Wold N, Wold S (2001) Multi- and megavariate data analysis: principles and applications. Umetrics AB, UmeaGoogle Scholar