Abstract
Standard molecular experimental methodologies and mathematical procedures often fail to answer many phylogeny and classification related issues. Modern artificial intelligent-based techniques, such as radial basis function, genetic algorithm, artificial neural network, and support vector machines are of ample potential in this regard. Reliance on a large number of essential parameters will aid in enhanced robustness, reliability, and better accuracy as opposed to single molecular parameter. This study was conducted with dataset of computed protein physicochemical properties belonging to 20 different bacterial genera. A total of 57 sequential and structural parameters derived from protein sequences were considered for the initial classification. Feature selection based techniques were employed to find out the most important features influencing the dataset. Various amino acids, hydrophobicity, relative sulfur percentage, and codon number were selected as important parameters during the study. Comparative analyses were performed applying RapidMiner data mining platform. Support vector machine proved to be the best method with maximum accuracy of more than 91 %.
Similar content being viewed by others
References
Godfray, H. C. J. (2002). Nature, 417, 17–19.
Yooseph, S., Li, W., & Sutton, G. (2008). BMC Bioinformatics, 9, 182.
Xiao, Y., & Segal, M. R. (2008). Bioinformatics, 24(9), 1198–1205.
Rubinstein, N. D., Mayrose, I., & Pupko, T. (2009). Molecular Immunology, 46, 840–847.
Nanni, L., & Lumini, A. (2009). Neural Computing and Applications, 18, 185–192.
Murty, U. S. N., Banerjee, A. K., & Arora, N. (2009). Interdisciplinary Sciences, 1, 173–178.
Werner, D., Martin, G., & Berrar, D. P. (Eds.). (2007). Fundamentals of data mining in genomics and proteomics, XXII (282) (p. 68). Berlin: Springer.
Guarracino, M. R., Chinchuluun, A., & Pardalos, P. M. (2009). Optimization Letters, 3, 357–366.
Banerjee, A. K., Manasa, B. P., & Murty, U. S. N. (2010). Indian Journal of Biochemistry & Biophysics, 47(6), 370–377.
Murty, U. S. N., Banerjee, A. K., & Arora, N. (2009). Journal of Proteomics & Bioinformatics, 2, 97–107.
Banerjee, A. K., Arora, N., & Murty, U. S. N. (2008). Elect J Biol, 4(1), 27–33.
Banerjee, A. K., Arora, N., Pranitha, V., & Murty, U. S. N. (2008). Journal of Proteomics & Bioinformatics, 1, 77–089.
Zhang, L., Shao, C., Zheng, D., & Gao, Y. (2006). Molecular & Cellular Proteomics, 5(7), 1224–1232.
Ganesan, P., Tang, K., Suganthan, P. N., Archunan, G., & Sowdhamini, R. (2007). BMC Bioinformatics, 8, 351.
King, R. D., & Sternberg, M. J. E. (1990). Journal of Molecular Biology, 216(2), 441–457.
Banerjee, A. K., Harikrishna, N., Vikram Kumar, J., & Murty, U. S. N. (2011). Applied Artificial Intelligence, 25(5), 426–439.
Matsushita, M., & Janda, K. D. (2002). Bioorganic & Medicinal Chemistry, 10, 855–867.
Qin, Z., Zhang, J., Xu, B., Chen, L., Wu, Y., Yang, X., et al. (2006). BMC Microbiology, 6, 96.
Deschenes, R. J., Lin, H., Ault, A. D., & Fassler, J. S. (1990). Antimicrobial Agents and Chemotherapy, 43(7), 1700–1703.
Wai-Leung, N., Wei, Y., Perez, L. J., Cong, J., Long, T., Koch, M., et al. (2010). Proceedings of the National Academy of Sciences of the United States of America, 107(12), 5575–5580.
Surette, M. G., Levit, M., Liu, Y., Lukat, G., Ninfai, E. G., Ninfai, A., et al. (1996). Journal of Biological Chemistry, 271(2), 939–945.
Alm, E., Huang, K., & Arkin, A. (2006). PLoS Computational Biology, 2(11), e143.
Kim, D., & Forst, S. (2001). Microbiology, 147, 1197–1212.
Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M. R., Walker, J. M., et al. (2005). Protein identification and analysis tools on the ExPASy server. The proteomics protocols handbook (pp. 571–607). New York: Humana Press.
Han, J., Rodriguez, J. C., & Beheshti, M. (2008). Second International Conference on Future Generation Communication and Networking, 3, 96–99.
Demner-Fushman, D., Antani, S., Simpson, M., & Thoma, G. R. (2009). International Journal of Medical Informatics, 78, e59–e67.
Nguyen, N. T., Kowalczyk, R., & Chen, S. M. (Eds.). (2009). ICCCI LNAI, 5796, pp. 800–812.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
Lin, Y. C., Hwang, K. S., & Wang, F. S. (2002). Hybrid differential evolution with multiplier updating method for nonlinear constrained optimization problems. In: Computational Intelligence, WCCI, Proceedings of the 2002 World Congress, 1, pp. 872–877.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press.
Cortes, C., & Vapnik, V. N. (1995). Machine Learning, 20, 273–297.
Ames, C., Turner, B., & Daniel, B. (2006). Estimating the post-mortem interval (I): the use of genetic markers to aid in identification of Dipteran species and subpopulations. International Congress Series, 1288, 795–797.
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
ESM 1
(DOC 80 kb)
Rights and permissions
About this article
Cite this article
Banerjee, A.K., Ravi, V., Murty, U.S.N. et al. Application of Intelligent Techniques for Classification of Bacteria Using Protein Sequence-Derived Features. Appl Biochem Biotechnol 170, 1263–1281 (2013). https://doi.org/10.1007/s12010-013-0268-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12010-013-0268-1