Abstract
Interesting biological information as, for example, gene expression data (microarrays), can be extracted from publicly available genomic data. As a starting point in order to narrow down the great possibilities of wet lab experiments, global high throughput data and available knowledge should be used to infer biological knowledge and emit biological hypothesis. Here, based on microarray data, we propose the use of cluster and classification methods that have become very popular and are implemented in freely available software in order to predict the participation in virulence mechanisms of different proteins coded by genes of the pathogen Streptococcus pyogenes. Confidence of predictions is based on classification errors of known genes and repetitive prediction by more than three methods. A special emphasis is done on the nonlinear kernel classification methods used. We propose a list of interesting candidates that could be virulence factors or that participate in the virulence process of S. pyogenes. Biological validations should start using this list of candidates as they show similar behavior to known virulence factors.
Similar content being viewed by others
Notes
For other interpretations of the values α i please refer to Schebesch and Stecking (2005).
References
Bisno A, Brito M, Collins CM (2003) Molecular basis of group A streptococcal virulence. Lancet Infect Dis 3:191–200
Bleakley K, Biau G, Vert JP (2007) Supervised reconstruction of biological networks with local models. Bioinformatics 23:i57–i65
Clarke B, Fokoué E, Zhang H (2009) Principles and theory for data mining and machine learning. Springer, New York
Cox KH, Ruiz-Bustos E, Courtney HS, Dale JB, Pence MA, Nizet V, Aziz RK, Gerling I, Price SM, Hasty DL (2009) Inactivation of DltA modulates virulence factor expression in Streptococcus pyogenes. PLoS One 4(4):e5366. doi:10.1371/journal.pone.0005366
Friedman JH (1989) Regularized discriminant analysis. JASA 84:165–175
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Kohonen T (2000) Self-organizing maps, 3rd edn. Springer-Verlag, Berlin
Leiva-Valdebenito S, Torres-Avilés F (2010) A review of the most common partition algorithms in cluster analysis: a comparative study. Rev Colomb Estad 33(2):321–339
López-Kleine L, Monnet V, Pechoux C, Trubuil A (2008) Role of bacterial peptidase F inferred by statistical analysis and further experimental validation. HFSP J 2:29–41
López-Kleine L, Ospina L, Molano N (2012) Using multivariate methods to infer knowledge from genomic data. International Journal of Bioinformatics Research and Applications. (in press)
Qi Y, Klein-Seetharaman J, Bar-Joseph Z (2005) Random forest similarity for protein–protein interaction prediction from multiple sources. Pac Symp Biocomput 10:531–542
R Development Core Team (2011) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/
Sagar V, Kumar R, Ganguly NK, Chakraborti A (2008) Comparative analysis of emm type pattern of group A Streptococcus throat and skin isolates from India and their association with closely related SIC, a streptococcal virulence factor. BMC Microbiol 16(8):150
Schebesch B, Stecking R (2005) Support vector machines for classifying and describing credit applicants: detecting typical and critical regions. J Oper Res Soc 56(9):1082–1088
Schölkopf B, Smola A (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. The MIT Press, Cambridge
Shelburne SA, Keith D, Horstmann N, Sumby P, Davenport MT, Graviss EA, Brennan RG, Musser JM (2008) A direct link between carbohydrate utilization and virulence in the major human pathogen group A Streptococcus. PNAS 105(5):1698–1703
Virtaneva K, Porcella SF, Graham MR, Ireland RM, Johnson CA, Ricklefs SM, Babar I, Parkins LD, Romero RA, Corn GJ, Gardner DJ, Bailey JR, Parnell MJ, Musser JM (2005) Longitudinal analysis of the group A Streptococcus transcriptome in experimental pharyngitis in cynomolgus macaques. PNAS 102:9014–9019
Werhli AV, Husmeier D (2007) Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol 6:15
Yamanishi Y, Vert JP, Nakaya A, Kanehisa M (2003) Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics 19:i323–i330
Acknowledgments
This work was partially financed by the Fundación para el avance de la ciencia of the Colombian Banco de la República.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
López-Kleine, L., Torres-Avilés, F., Tejedor, F.H. et al. Virulence factor prediction in Streptococcus pyogenes using classification and clustering based on microarray data. Appl Microbiol Biotechnol 93, 2091–2098 (2012). https://doi.org/10.1007/s00253-012-3917-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00253-012-3917-3