Annals of Operations Research

, Volume 174, Issue 1, pp 219–235 | Cite as

Disparate data fusion for protein phosphorylation prediction

  • Genetha A. Gray
  • Pamela J. Williams
  • W. Michael Brown
  • Jean-Loup Faulon
  • Kenneth L. Sale


New challenges in knowledge extraction include interpreting and classifying data sets while simultaneously considering related information to confirm results or identify false positives. We discuss a data fusion algorithmic framework targeted at this problem. It includes separate base classifiers for each data type and a fusion method for combining the individual classifiers. The fusion method is an extension of current ensemble classification techniques and has the advantage of allowing data to remain in heterogeneous databases. In this paper, we focus on the applicability of such a framework to the protein phosphorylation prediction problem.


Ensemble classification Phosphorylation Base classifier Fusion 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Aderem, A., & Ulevitch, R. (2000). Toll-like receptors in the induction of the innate immune response. Nature, 406, 782–787. CrossRefGoogle Scholar
  2. Aleskerov, E., Freisleben, B., & Rao, B. (1997). A neural network based database mining system for credit card fraud detection. In Proceedings of computational intelligence for financial engineering (pp. 220–226). Google Scholar
  3. Al-Subaie, M., & Zulkernine, M. (2006). Efficacy of hidden Markov models over neural networks in anomaly detection. In Proceedings of the 30th annual international computer software and applications conference. IEEE Computer Society. Google Scholar
  4. Banfield, R. E., Hall, L., Bowyer, K., Bhadoria, D., Kegelmeyer, W., & Eschrich, S. (2004). A comparison of ensemble creation techniques. In Fifth workshop on multiple classifier systems (MCS 2004) (pp. 223–232). Google Scholar
  5. Berry, E., Dalby, A., & Yang, Z. (2004). Reduced bio basis function neural network for identification of protein phosphorylation sites: Comparison with pattern recognition algorithms. Computational Biology and Chemistry, 28, 75–85. CrossRefGoogle Scholar
  6. Beutler, B. (2000). Inferences, questions, and possibilities in toll-like receptor signaling. Nature, 430, 257–263. CrossRefGoogle Scholar
  7. Blom, N., Kreegipuu, A., & Brunak, S. (1998). PhosphoBase: A database of phosphorylation sites. Nucleic Acids Research, 26, 382–386. CrossRefGoogle Scholar
  8. Blom, N., Gammeltoft, S., & Brunak, S. (1999). Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. Journal of Molecular Biology, 294(5), 1351–1362. CrossRefGoogle Scholar
  9. Boeckmann, B., et al. (1998). The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research, 26, 382–386. CrossRefGoogle Scholar
  10. Bradley, A. (1996). ROC curves and the chi 2 test. Pattern Recognition Letters, 17, 287–294. CrossRefGoogle Scholar
  11. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Google Scholar
  12. Brown, W. M., et al. (2006a). Designing novel polymers with targeted properties using the signature molecular descriptor. Journal of Chemical Information and Modeling, 46, 826–835. CrossRefGoogle Scholar
  13. Brown, W. M., et al. (2006b). Prediction of beta-strand packing interactions using the signature product. Journal of Molecular Modeling, 12, 355–361. CrossRefGoogle Scholar
  14. Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology, 51, 79–94. Google Scholar
  15. Churchwell, C. J., et al. (2004). The signature molecular descriptor. 3. Inverse-quantitative structure-activity relationship of ICAM-1 inhibitor peptides. Journal of Molecular Graphics and Modelling, 22, 263–273. CrossRefGoogle Scholar
  16. Diella, F., et al. (2004). PhosphoELM: A database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics, 5, 79–83. CrossRefGoogle Scholar
  17. DiMaira, G. (2005). Protein kinase cks phosphorylates and upregulates akt/pkb. Cell Death and Differentiation, 12(6), 668–677. CrossRefGoogle Scholar
  18. Doniger, S., Hofmann, T., & Yeh, J. (2002). Predicting CNS permeability of drug molecules: Comparison of neural network and support vector machine algorithms. Journal of Comparative Biology, 9(6), 849–864. CrossRefGoogle Scholar
  19. Eddy, S. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 775–764. CrossRefGoogle Scholar
  20. Egan, J. (1975). Signal detection theory and ROC analysis. Series in Cognition and Perception. New York: Academic Press. Google Scholar
  21. Faulon, J.-L., Churchwell, C. J., & Jr, D. V. (2003a). The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. Journal of Chemical Information and Computer Sciences, 43, 721–734. Google Scholar
  22. Faulon, J.-L., Jr, D. V., & Pophale, R. (2003b). The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. Journal of Chemical Information and Computer Sciences, 43, 707–720. Google Scholar
  23. Faulon, J.-L., Collins, M. J., & Carr, R. D. (2004). The signature molecular descriptor. 4. Canonizing molecules using extended valence sequences. Journal of Chemical Information and Computer Sciences, 44, 427–436. Google Scholar
  24. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861–874. CrossRefGoogle Scholar
  25. Ghosh, S., & Reilly, D. (1994). Credit card fraud detection with a neural network. In J. F. Nunamaker & R. H. Sprague (Eds.), Proceedings of 27th Hawaii international conference on system sciences (pp. 621–630). Google Scholar
  26. Graves, L., Bornfeldt, K., & Kregs, E. (1997). Historical perspectives and new insights involving the MAP kinase cascades. Advances in Second Messenger Phosphorate Research, 31, 49–62. Google Scholar
  27. Gutteridge, A., Bartlett, G., & Thornton, J. (2003). Using a neural network and spatial clustering to predict the location of active sites in enzymes. Journal of Molecular Biology, 330(4), 719–734. CrossRefGoogle Scholar
  28. He, H., Graco, W., Wand, J., & Hawkins, S. (1997). Application of neural networks to detection of medical fraud. Expert Systems with Applications, 13, 329–336. CrossRefGoogle Scholar
  29. Huang, H. D., et al. (2005). KinasePhos: A web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Research, 33, 226–229. CrossRefGoogle Scholar
  30. Kanehisa, M., et al. (2006). From genomics to chemical genomics: new developments in KEGG. Nuclear Acids Research, 34, D354–357. CrossRefGoogle Scholar
  31. Kim, J. et al. (2004). Prediction of phosphorylation sites using svms. Bioinformatics, 20(1), 3179–3184. CrossRefGoogle Scholar
  32. Koks, D., & Challa, S. (2003). An introduction to Bayesian and Dempster-Shafer data fusion (Technical Report DSTO-TR-1436). Edinburgh, Australia: Defence Science and Tech Org. Google Scholar
  33. Kolibaba, K., & Druker, B. (1997). Protein tyrosine kinases and cancer. Biochemica et Biophysica Acta, 1333(3), F217–248. Google Scholar
  34. Krogh, A. (1998). In S. L. Salzberg, D. B. Searls, & S. Kasif (Eds.), Computational methods in molecular biology. Amsterdam: Elsevier. Google Scholar
  35. Krogh, A., et al. (1994). Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235, 1501–1531. CrossRefGoogle Scholar
  36. Lane, T., & Brodley, C. E. (2003). An empirical study of two approaches to sequence learning for anomaly detection. Machine Learning, 51(1), 73–107. CrossRefGoogle Scholar
  37. Littlestone, N., & Warmuth, M. (1994). The weighted majority voting algorithm. Information and Computation, 108, 212–261. CrossRefGoogle Scholar
  38. Lu, W., et al. (2007). The phosphorylation of tyrosine 332 is necessary for the caspase 3-dependent cleavage of PKC[delta] and the regulation of cell apoptosis. Cell Signaling, 19(10), 2165–2173. CrossRefGoogle Scholar
  39. Martin, S., Roe, D., & Faulon, J.-L. (2005). Predicting protein-protein interactions using signature products. Bioinformatics, 21, 218–226. CrossRefGoogle Scholar
  40. Minsky, M., & Papert, S. (1969). Perceptions: An introduction to computational geometry. Cambridge, MA: MIT Press. Google Scholar
  41. Narayanan, A., Wu, X., & Yang, Z. (2002). Mining viral protease data to extract cleavage knowledge. Bioinformatics, 18, 5–13. Google Scholar
  42. Obenauer, J., Cantley, L., & Yaffe, M. (2003). Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Research, 31(13), 3635–3641. CrossRefGoogle Scholar
  43. Olsson, B., & Laurio, K. (2002). Towards a comprehensive collection of diagnostic patterns for protein sequence classification. Information Science, 143(1–4), 1–11. CrossRefGoogle Scholar
  44. Pinna, L. A., & Ruzzene, M. (1996). How do protein kinases recognize their substrates? Biochemica et Biophysica Acta, 1314(3), 191–225. Google Scholar
  45. Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286. CrossRefGoogle Scholar
  46. Rabiner, L., & Juang, B.-H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice-Hall. Google Scholar
  47. Reinhardt, A., & Hubbard, T. (1998). Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research, 26, 2230–2236. CrossRefGoogle Scholar
  48. Rogers, K., et al. (1995). Automatic target recognition using neural networks. In K. Rogers & D. W. Ruck (Eds.), Proceedings of the SPIE (Vol. 2492, pp. 346–360). Google Scholar
  49. Rost, B., & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232(2), 584–599. CrossRefGoogle Scholar
  50. Roth, M. (1990). Survey of neural network technology for automatic target recognition. IEEE Transactions on Neural Networks, 1(1), 28–43. CrossRefGoogle Scholar
  51. Rumelhart, D., Hinton, G., & Williams, R. (1986a). Learning internal representations by backpropagating errors. Nature, 323(28), 533–536. CrossRefGoogle Scholar
  52. Rumelhart, D., Hinton, G., & Williams, R. (1986b). Learning internal representations by error propagation. In O. Rumelhart, J. McClelland, & P. R. Group (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Google Scholar
  53. Schuller, R., Ben-David, S., & Gehrke, J. (2002). A theoretical framework for learning from a pool of disparate data sources. In Proceedings of the 2002 KDD conference (pp. 443–449). Google Scholar
  54. Selin, I. (1965). Detection theory. Princeton, NJ: Princeton University Press. Google Scholar
  55. Sollich, P., & Krogh, A. (1996). Learning with ensembles: How over-fitting can be useful. In Advances in neural information processing systems (Vol. 8, pp. 190–196). Cambridge, MA: MIT Press. Google Scholar
  56. Srinivasan, B. (2005). Genome annotation through phylogenomic mapping. Nature Biotechnology, 23(6). Google Scholar
  57. Stuart, J. M., et al. (2003). A gene-coexpression network for global discovery of conserved genetic modules. Science, 302. Google Scholar
  58. Waibel, A. (1990). Modular construction of time-delay neural networks for speech recognition. Neural Computation, 1(1), 39–46. CrossRefGoogle Scholar
  59. Xue, Y., Zhou, F., Zhu, M., Chen, G., & Yao, X. (2005). GPS: a comprehensive www server for phosphorylation sites prediction. Nucleic Acids Research, 33. Google Scholar
  60. Xue, Y., Li, A., Wang, L., Feng, H., & Yao, X. (2006). PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinformatics, 7(163). Google Scholar
  61. Yaffe, M., Leparc, G., Lai, J., Obata, T., Volinia, S., & Cantley, L. (2001). A motif-based profile scanning approach for genome-wide prediction of signaling pathways. Nature Biotechnology, 19(4), 348–353. CrossRefGoogle Scholar
  62. Zavaliagkos, G., Zhoa, Y., Schwartz, R., & Makhoul, J. (1994). A hybrid segmental neural net/hidden Markov model system for continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 2(1), 151–160. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Genetha A. Gray
    • 1
  • Pamela J. Williams
    • 1
  • W. Michael Brown
    • 2
  • Jean-Loup Faulon
    • 3
  • Kenneth L. Sale
    • 4
  1. 1.Computational Sciences & Mathematics Research DepartmentSandia National LaboratoriesLivermoreUSA
  2. 2.Computational Biology DepartmentSandia National LaboratoriesAlbuquerqueUSA
  3. 3.Computational Bioscience DepartmentSandia National LaboratoriesAlbuquerqueUSA
  4. 4.Biosystems Research DepartmentSandia National LaboratoriesLivermoreUSA

Personalised recommendations