Predicting Protein Localization Using a Domain Adaptation Approach

  • Nic Herndon
  • Doina Caragea
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 452)


A challenge arising from the ever-increasing volume of biological data generated by next generation sequencing technologies is the annotation of this data, e.g. identification of gene structure from the location of splice sites, or prediction of protein function/localization. The annotation can be achieved by using automated classification algorithms. Supervised classification requires large amounts of labeled data for the problem at hand. For many problems, labeled data is not available. However, labeled data might be available for a similar, related problem. To leverage the labeled data available for the related problem, we propose an algorithm that builds a naïve Bayes classifier for biological sequences in a domain adaptation setting. Specifically, it uses the existing large corpus of labeled data from a source organism, in conjunction with any available labeled data and lots of unlabeled data from a target organism, thus alleviating the need to manually label a large number of sequences for a supervised classifier. When tested on the task of predicting protein localization from the composition of the protein, this algorithm performed better than the multinomial naïve Bayes classifier. However, on a more difficult task, of splice site prediction, the results were not satisfactory.


Naïve Bayes Domain adaptation Supervised learning Semi-supervised learning Self-training Biological sequences Protein localization 



The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CNS-1006860, EPS-1006860, EPS-0919443, and MRI-1126709.


  1. 1.
    Baten, A., Chang, B., Halgamuge, S., Li, J.: Splice site identification using probabilistic parameters and SVM classification. BMC Bioinform. 7(Suppl. 5), S15 (2006)CrossRefGoogle Scholar
  2. 2.
    Bernal, A., Crammer, K., Hatzigeorgiou, A., Pereira, F.: Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3(3), e54 (2007)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares Jr., M., Haussler, D.: Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS 97(1), 262–267 (2000)CrossRefGoogle Scholar
  4. 4.
    Dai, W., Xue, G., Yang, Q., Yu, Y.: Transferring naïve bayes classifiers for text classification. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence (2007)Google Scholar
  5. 5.
    Degroeve, S., Saeys, Y., De Baets, B., Rouzé, P., Van De Peer, Y.: Splicemachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 21(8), 1332–1338 (2005)CrossRefGoogle Scholar
  6. 6.
    Eaton, J.W., Bateman, D., Hauberg, S.: GNU Octave Manual Version 3. Network Theory Ltd., Bristol (2008)Google Scholar
  7. 7.
    Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G.: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300(4), 1005–1016 (2000)CrossRefGoogle Scholar
  8. 8.
    Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M., Brinkman, F.S.L.: Psortb v. 2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5), 617–623 (2005)CrossRefGoogle Scholar
  9. 9.
    Gardy, J.L., Spencer, C., Wang, K., Ester, M., Tusnády, G.E., Simon, I., Hua, S., deFays, K., Lambert, C., Nakai, K., Brinkman, F.S.: Psort-b: improving protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids Res. 31(13), 3613–3617 (2003)CrossRefGoogle Scholar
  10. 10.
    Huang, J., Li, T., Chen, K., Wu, J.: An approach of encoding for prediction of splice sites using svm. Biochimie 88, 923–929 (2006)CrossRefGoogle Scholar
  11. 11.
    Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pp. 487–493. MIT Press, Cambridge (1999)Google Scholar
  12. 12.
    Jiang, J., Zhai, C.: A two-stage approach to domain adaptation for statistical classifiers. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM ’07, pp. 401–410. ACM, New York (2007)Google Scholar
  13. 13.
    Lorena, A.C., de Carvalho, A.C.P.L.F.: Human splice site identification with multiclass support vector machines and bagging. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN/ICONIP 2003. LNCS, vol. 2714, pp. 234–241. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  14. 14.
    Maeireizo, B., Litman, D., Hwa, R.: Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, ACLdemo ’04. Association for Computational Linguistics, Stroudsburg (2004)Google Scholar
  15. 15.
    Mccallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on ‘Learning for Text Categorization’ (1998)Google Scholar
  16. 16.
    Müller, K.-R., Mika, S., Rätsch, G., Tsuda, S., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks 12(2), 181–202 (2001)CrossRefGoogle Scholar
  17. 17.
    Nigam, K., Mccallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (1999)Google Scholar
  18. 18.
    Noble, W.S.: What is a support vector machine? Nat Biotechnol. 24(12), 1565–1567 (2006)CrossRefMathSciNetGoogle Scholar
  19. 19.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)CrossRefGoogle Scholar
  20. 20.
    Rätsch, G., Sonnenburg, S.: Accurate splice site detection for caenorhabditis elegans. In: Schölkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Computational Biology, pp. 277–298. MIT Press, Cambridge (2004)Google Scholar
  21. 21.
    Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.-R., Sommer, R., Schölkopf, B.: Improving the c. elegans genome annotation using machine learning. PLoS Comput. Biol. 3, e20 (2007)CrossRefGoogle Scholar
  22. 22.
    Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL ’03, vol. 4, pp. 25–32. Association for Computational Linguistics, Stroudsburg (2003)Google Scholar
  23. 23.
    Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)Google Scholar
  24. 24.
    Schweikert, G., Widmer, C., Schölkopf, B., Rätsch, G.: An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: NIPS’08, pp. 1433–1440 (2008)Google Scholar
  25. 25.
    Sonnenburg, S., Rätsch, G., Jagota, A., Müller, K.-R.: New methods for splice site recognition. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 329–336. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  26. 26.
    Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinf. 8(Suppl. 10), 1–16 (2007)Google Scholar
  27. 27.
    Tan, S., Cheng, X., Wang, Y., Xu, H.: Adapting Naive Bayes to domain adaptation for sentiment analysis. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 337–349. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  28. 28.
    Tsuda, K., Kawanabe, M., Rätsch, G., Sonnenburg, S., Müller, K.-R.: A new discriminative kernel from probabilistic models. Neural Comput. 14(10), 2397–2414 (2002)CrossRefMATHGoogle Scholar
  29. 29.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New York Inc., New York (1995)CrossRefMATHGoogle Scholar
  30. 30.
    Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL ’95, pp. 189–196. Association for Computational Linguistics, Stroudsburg (1995)Google Scholar
  31. 31.
    Zhang, Y., Chu, C.-H., Chen, Y., Zha, H., Ji, X.: Splice site prediction using support vector machines with a bayes kernel. Expert Syst. Appl. 30(1), 73–81 (2006)CrossRefGoogle Scholar
  32. 32.
    Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16(9), 799–807 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Kansas State UniversityManhattanUSA

Personalised recommendations