Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme

  • Shibiao WanEmail author
  • Man-Wai MakEmail author
Original Article


From the perspective of machine learning, predicting subcellular localization of multi-location proteins is a multi-label classification problem. Conventional multi-label classifiers typically compare some pattern-matching scores with a fixed decision threshold to determine the number of subcellular locations in which a protein will reside. This simple strategy, however, may easily lead to over-prediction due to a large number of false positives. To address this problem, this paper proposes a more powerful multi-label predictor, namely AD–SVM, which incorporates an adaptive-decision (AD) scheme into multi-label support vector machine (SVM) classifiers. Specifically, given a query protein, a term-frequency based gene ontology vector is constructed by successively searching the gene ontology annotation database. Subsequently, the feature vector is classified by AD–SVM, which extends the binary relevance method with an adaptive decision scheme that essentially converts the linear SVMs to piecewise linear SVMs. Experimental results suggest that AD–SVM outperforms existing state-of-the-art multi-location predictors by at least 4 % (absolute) for a stringent virus dataset and 1 % (absolute) for a stringent plant dataset, respectively. Results also show that the adaptive-decision scheme can effectively reduce over-prediction while having insignificant effect on the correctly predicted ones.


Adaptive decisions Multi-label classification Protein subcellular localization Support vector machines 



This work was in part supported by the RGC of Hong Kong SAR (Grant No. PolyU 152117/14E).


  1. 1.
    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 25:3389–3402CrossRefGoogle Scholar
  2. 2.
    Barrel D, Dimmer E, Huntley RP, Binns D, O’Donovan C, Apweiler R (2009) The GOA database in 2009—an integrated Gene Ontology Annotation resource. Nucl Acids Res 37:D396–D403CrossRefGoogle Scholar
  3. 3.
    Barutcuoglu Z, Schapire RE, Troyanskaya OG (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–836CrossRefGoogle Scholar
  4. 4.
    Boutell M, Luo J, Shen X, Brown C (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771CrossRefGoogle Scholar
  5. 5.
    Brady S, Shatkay H (2008) EpiLoc: a (working) text-based system for predicting protein subcellular location. In: Pacific symposium biocomputing, pp 604–615Google Scholar
  6. 6.
    Chou KC (2001) Prediction of protein cellular attributes using pseudo amino acid composition. Proteins Struct Funct Genet 43:246–255CrossRefGoogle Scholar
  7. 7.
    Chou KC (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol BioSyst 9:1092–1100CrossRefGoogle Scholar
  8. 8.
    Chou KC, Cai YD (2004) Prediction of protein subcellular locations by GO-FunD-PseAA predictor. Biochem Biophys Res Commun 320:1236–1239CrossRefGoogle Scholar
  9. 9.
    Chou KC, Shen HB (2006) Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. J Proteome Res 5:1888–1897CrossRefGoogle Scholar
  10. 10.
    Chou KC, Shen HB (2010) Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE 5:e11335CrossRefGoogle Scholar
  11. 11.
    Clare A, King RD (2001) Knowledge discovery in multi-label phenotype data. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, pp 42–53Google Scholar
  12. 12.
    Dembczynski K, Waegeman W, Cheng W, Hullermeier E (2012) On label dependence and loss minimization in multi-label classification. Mach Learn 88(1–2):5–45MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Dietterich TG, Bakari G (1995) Solving multiclass learning problem via error-correcting output codes. J Artif Intell Res 2:263–286zbMATHGoogle Scholar
  14. 14.
    Elisseeff A, Weston J (2001) Kernel methods for multi-labelled classification and categorical regression problems. In: In advances in neural information processing systems, vol 14. MIT Press, Cambridge, MA, pp 681–687Google Scholar
  15. 15.
    Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016CrossRefGoogle Scholar
  16. 16.
    Foster LJ, De Hoog CL, Zhang Y, Zhang Y, Xie X, Mootha VK, Mann M (2006) A mammalian organelle map by protein correlation profiling. Cell 125:187–199CrossRefGoogle Scholar
  17. 17.
    Freund Y, Schapire R (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612Google Scholar
  18. 18.
    Fyshe A, Liu Y, Szafron D, Greiner R, Lu P (2008) Improving subcellular localization prediction using text classification and the gene ontology. Bioinformatics 24:2512–2517CrossRefGoogle Scholar
  19. 19.
    Gao W, Zhou ZH (2011) On the consistency of multi-label learning. In: Proceedings of the 24th annual conference on learning theory, pp 341–358Google Scholar
  20. 20.
    Ghamrawi N, McCallum A (2005) Collective multi-label classification. In: Proceedings of the 2005 ACM conference on information and knowledge management (CIKM’05), pp 195–200Google Scholar
  21. 21.
    Gillick L, Cox SJ (1989) Some statistical issues in the comparison of speech recognition algorithms. In: 1989 IEEE international conference on acoustics, speech, and signal processing (ICASSP’89). IEEE Press, New York, pp 532–535Google Scholar
  22. 22.
    Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 22–30Google Scholar
  23. 23.
    Hastie T, Tibshirani R, Friedman J (2001) The element of statistical learning. Springer, BerlinCrossRefzbMATHGoogle Scholar
  24. 24.
    He J, Gu H, Liu W (2011) Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. PLoS ONE 7(6):e37155CrossRefGoogle Scholar
  25. 25.
    Hsu D, Kakade SM, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. Adv Neural Inf Process Syst 22:772–780Google Scholar
  26. 26.
    Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD 2008 discovery challengeGoogle Scholar
  27. 27.
    Kressel U (1999) Pairwise classification and support vector machines. In: Advances in kernel methods: support vcector learning, Chap 15. MIT Press, Cambridge, MAGoogle Scholar
  28. 28.
    Li LQ, Zhang Y, Zou LY, Li CQ, Yu B, Zheng XQ, Zhou Y (2012) An ensemble classifier for eukaryotic protein subcellular location prediction using Gene Ontology categories and amino acid hydrophobicity. PLoS ONE 7(1):e31057CrossRefGoogle Scholar
  29. 29.
    Li T, Ogihara M (2006) Toward intelligent music information retrieval. IEEE Trans Multimed 8(3):564–574CrossRefGoogle Scholar
  30. 30.
    Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20(4):547–556CrossRefGoogle Scholar
  31. 31.
    Mak MW, Guo J, Kung SY (2008) PairProSVM: protein subcellular localization based on local pairwise profile alignment and SVM. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 5(3):416–422CrossRefGoogle Scholar
  32. 32.
    Mei S (2012) Multi-label multi-kernel transfer learning for human protein subcellular localization. PLoS ONE 7(6):e37716CrossRefGoogle Scholar
  33. 33.
    Millar AH, Carrie C, Pogson B, Whelan J (2009) Exploring the function-location nexus: using multiple lines of evidence in defining the subcellular location of plant proteins. Plant Cell 21(6):1625–1631CrossRefGoogle Scholar
  34. 34.
    Moskovitch R, Cohenkashi S, Dror U, Levy I, Maimon A, Shahar Y (2006) Multiple hierarchical classification of free-text clinical guidelines. Artif Intell Med 37:177–190CrossRefGoogle Scholar
  35. 35.
    Mott R, Schultz J, Bork P, Ponting CP (2002) Predicting protein cellular localization using a domain projection method. Genome Res 12(8):1168–1174CrossRefGoogle Scholar
  36. 36.
    Mueller JC, Andreoli C, Prokisch H, Meitinger T (2004) Mechanisms for multiple intracellular localization of human mitochondrial proteins. Mitochondrion 3:315–325CrossRefGoogle Scholar
  37. 37.
    Murphy RF (2010) communicating subcellular distributions. Cytometry 77(7):686–92CrossRefGoogle Scholar
  38. 38.
    Nair R, Rost B (2002) Sequence conserved for subcellular localization. Protein Sci 11:2836–2847CrossRefGoogle Scholar
  39. 39.
    Nakai K, Kanehisa M, Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins Struct Funct Genet 11(2):95–110CrossRefGoogle Scholar
  40. 40.
    Nakashima H, Nishikawa K (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol 238:54–61CrossRefGoogle Scholar
  41. 41.
    Nielsen H, Engelbrecht J, Brunak S, von Heijne G (1997) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int J Neural Syst 8:581–599CrossRefGoogle Scholar
  42. 42.
    Quinlan JR (1993) C4.5: programs for machine learning, vol 1. Morgan Kaufmann, Los Altos, CAGoogle Scholar
  43. 43.
    Rea S, James D (1997) Moving GLUT4: the biogenesis and trafficking of GLUT4 storage vesicles. Diabetes 46:1667–1677CrossRefGoogle Scholar
  44. 44.
    Read J, Pfahringer B, Holmes G, Frank E (2009) Classifier chains for multi-label classification. In: Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases, pp 254–269Google Scholar
  45. 45.
    Rousu J, Saunders C, Szedmak S, Shawe-Taylor J (2006) Kernel-based learning of hierarchical multilabel classification methods. J Mach Learn Res 7:1601–1626MathSciNetzbMATHGoogle Scholar
  46. 46.
    Russell R, Bergeron R, Shulman G, Young H (1997) Translocation of myocardial GLUT-4 and increased glucose uptake through activation of AMPK by AICAR. Am J Physiol 277:H643–649Google Scholar
  47. 47.
    Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2/3):135–168CrossRefzbMATHGoogle Scholar
  48. 48.
    Scholkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge, MAzbMATHGoogle Scholar
  49. 49.
    Shen HB, Chou KC (2010) Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites. J Biomol Struct Dyn 26:175–186CrossRefGoogle Scholar
  50. 50.
    Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th annual ACM international conference on multimedia, pp 421–430Google Scholar
  51. 51.
    Trohidis K, Tsoumakas G, Kalliris G, Vlahavas I (2006) Multilabel classification of music into emotions. In: Proceedings of the 9th international conference on music information retrieval, pp 325–330Google Scholar
  52. 52.
    Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Min 3:1–13CrossRefGoogle Scholar
  53. 53.
    Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Maimon O, Rokach l (eds) Data mining and knowledge discovery handbook, 2nd edn. Springer, Berlin, pp 667–685Google Scholar
  54. 54.
    Vapnik VN (1998) Statistical learning theory. Wiley, New YorkzbMATHGoogle Scholar
  55. 55.
    Vens C, Struyf J, Schietgat L, Dzeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 2(73):185–214CrossRefGoogle Scholar
  56. 56.
    Wan S, Mak MW (2015) Machine learning for protein subcellular localization prediction. De Gruyter, BerlinCrossRefGoogle Scholar
  57. 57.
    Wan S, Mak MW, Kung SY (2011) Protein subcellular localization prediction based on profile alignment and Gene Ontology. In: 2011 IEEE international workshop on machine learning for signal processing (MLSP’11), pp 1–6Google Scholar
  58. 58.
    Wan S, Mak MW, Kung SY (2012) GOASVM: Protein subcellular localization prediction based on gene ontology annotation and SVM. In: 2012 IEEE international conference on acoustics, speech, and signal processing (ICASSP’12), pp 2229–2232Google Scholar
  59. 59.
    Wan S, Mak MW, Kung SY (2012) mGOASVM: multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinform 13:290CrossRefGoogle Scholar
  60. 60.
    Wan S, Mak MW, Kung SY (2013) Adaptive thresholding for multi-label SVM classification with application to protein subcellular localization prediction. In: 2013 IEEE international conference on acoustics, speech, and signal processing (ICASSP’13), pp 3547–3551Google Scholar
  61. 61.
    Wan S, Mak MW, Kung SY (2013) GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou’s pseudo-amino acid composition. J Theor Biol 323:40–48CrossRefzbMATHGoogle Scholar
  62. 62.
    Wan S, Mak MW, Kung SY (2013) Semantic similarity over gene ontology for multi-label protein subcellular localization. Engineering 5:68–72CrossRefGoogle Scholar
  63. 63.
    Wan S, Mak MW, Kung SY (2014) HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS ONE 9(3):e89545CrossRefGoogle Scholar
  64. 64.
    Wan S, Mak MW, Kung SY (2014) R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization. J Theor Biol 360:34–45CrossRefzbMATHGoogle Scholar
  65. 65.
    Wan S, Mak MW, Kung SY (2015) Mem-mEN: predicting multi-functional types of membrane proteins by interpretable elastic nets. IEEE/ACM Trans Comput Biol Bioinform. doi: 10.1109/TCBB.2015.2474407 Google Scholar
  66. 66.
    Wan S, Mak MW, Kung SY (2015) mLASSO-Hum: a LASSO-based interpretable human-protein subcellular localization predictor. J Theor Biol 382(2015):223–234CrossRefzbMATHGoogle Scholar
  67. 67.
    Wan S, Mak MW, Kung SY (2015) mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. Anal Biochem 473:14–27CrossRefGoogle Scholar
  68. 68.
    Wan S, Mak MW, Zhang B, Wang Y, Kung SY (2013) An ensemble classifier with random projection for predicting multi-label protein subcellular localization. In: 2013 IEEE international conference on bioinformatics and biomedicine (BIBM), pp 35–42Google Scholar
  69. 69.
    Wan S, Mak MW, Zhang B, Wang Y, Kung SY (2014) Ensemble random projection for multi-label classification with application to protein subcellular localization. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP’14). IEEE Press, New York, pp 5999–6003Google Scholar
  70. 70.
    Wu ZC, Xiao X, Chou KC (2011) iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Mol BioSyst 7:3287–3297CrossRefGoogle Scholar
  71. 71.
    Xiao X, Wu ZC, Chou KC (2011) iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J Theor Biol 284:42–51CrossRefGoogle Scholar
  72. 72.
    Zhang ML, Zhou ZH (2005) A k-nearest neighbor based algorithm for multi-label classification. In: IEEE International conference on granular computing, pp 718–721Google Scholar
  73. 73.
    Zhang S, Xia XF, Shen JC, Zhou Y, Sun ZR (2008) DBMLoc: a database of proteins with multiple subcellular localizations. BMC Bioinform 9:127CrossRefGoogle Scholar
  74. 74.
    Zhou GP, Doctor K (2003) Subcellular location prediction of apoptosis proteins. Proteins Struct Funct Genet 50:44–48CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Department of Electronic and Information EngineeringThe Hong Kong Polytechnic UniversityHong KongChina

Personalised recommendations