AUC-Maximized Deep Convolutional Neural Fields for Protein Sequence Labeling

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9852)

Abstract

Deep Convolutional Neural Networks (DCNN) has shown excellent performance in a variety of machine learning tasks. This paper presents Deep Convolutional Neural Fields (DeepCNF), an integration of DCNN with Conditional Random Field (CRF), for sequence labeling with an imbalanced label distribution. The widely-used training methods, such as maximum-likelihood and maximum labelwise accuracy, do not work well on imbalanced data. To handle this, we present a new training algorithm called maximum-AUC for DeepCNF. That is, we train DeepCNF by directly maximizing the empirical Area Under the ROC Curve (AUC), which is an unbiased measurement for imbalanced data. To fulfill this, we formulate AUC in a pairwise ranking framework, approximate it by a polynomial function and then apply a gradient-based procedure to optimize it. Our experimental results confirm that maximum-AUC greatly outperforms the other two training methods on 8-state secondary structure prediction and disorder prediction since their label distributions are highly imbalanced and also has similar performance as the other two training methods on solvent accessibility prediction, which has three equally-distributed labels. Furthermore, our experimental results show that our AUC-trained DeepCNF models greatly outperform existing popular predictors of these three tasks. The data and software related to this paper are available at https://github.com/realbigws/DeepCNF_AUC.

Supplementary material

431508_1_En_1_MOESM1_ESM.pdf (647 kb)
Supplementary material 1 (pdf 647 KB)

References

  1. 1.
    Andreeva, A., Howorth, D., Chothia, C., Kulesha, E., Murzin, A.G.: Scop2 prototype: a new approach to protein structure mining. Nucleic Acids Res. 42(D1), D310–D314 (2014)Google Scholar
  2. 2.
    Blom, N., Sicheritz-Pontén, T., Gupta, R., Gammeltoft, S., Brunak, S.: Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 4(6), 1633–1649 (2004)CrossRefGoogle Scholar
  3. 3.
    Calders, T., Jaroszewicz, S.: Efficient AUC optimization for classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS(LNAI), vol. 4702, pp. 42–53. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74976-9_8 CrossRefGoogle Scholar
  4. 4.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATHGoogle Scholar
  5. 5.
    Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., Semantic image segmentation with deep convolutional nets, fully connected CRFs. arXiv preprint arXiv: 1412.7062 (2014)
  6. 6.
    Chothia, C.: The nature of the accessible and buried surfaces in proteins. J. Mol. Biol. 105(1), 1–12 (1976)CrossRefGoogle Scholar
  7. 7.
    Cortes, C., Mohri, M.: Auc optimization vs. error rate minimization. Adv. Neural Inf. Process. Syst. 16(16), 313–320 (2004)Google Scholar
  8. 8.
    De Lannoy, G., François, D., Delbeke, J., Verleysen, M.: Weighted conditional random fields for supervised interpatient heartbeat classification. IEEE Trans. Biomed. Eng. 59(1), 241–247 (2012)CrossRefGoogle Scholar
  9. 9.
    Di Lena, P., Nagata, K., Baldi, P.: Deep architectures for protein contact map prediction. Bioinformatics 28(19), 2449–2457 (2012)CrossRefGoogle Scholar
  10. 10.
    Dill, K.A.: Dominant forces in protein folding. Biochemistry 29(31), 7133–7155 (1990)CrossRefGoogle Scholar
  11. 11.
    Drozdetskiy, A., Cole, C., Procter, J., Barton, G.J.: Jpred4: a protein secondary structure prediction server. Nucleic Acids Res., gkv332 (2015)Google Scholar
  12. 12.
    Eickholt, J., Cheng, J.: Dndisorder: predicting protein disorder using boosting and deep networks. BMC Bioinf. 14(1), 88 (2013)CrossRefGoogle Scholar
  13. 13.
    Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Faraggi, E., Xue, B., Zhou, Y.: Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins: Struct., Funct., Bioinf. 74(4), 847–856 (2009)CrossRefGoogle Scholar
  15. 15.
    Ferri, C., Flach, P., Hernández-Orallo, J.: Learning decision trees using the area under the ROC curve. In: ICML, vol. 2, pp. 139–146 (2002)Google Scholar
  16. 16.
    Gross, S.S., Russakovsky, O., Do, C.B., Batzoglou, S.: Training conditional random fields for maximum labelwise accuracy. In: Advances in Neural Information Processing Systems, pp. 529–536 (2006)Google Scholar
  17. 17.
    Haas, J., Roth, S., Arnold, K., Kiefer, F., Schmidt, T., Bordoli, L., Schwede, T.: The protein model portala comprehensive resource for protein structure and model information. Database 2013, bat031 (2013)Google Scholar
  18. 18.
    Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)CrossRefGoogle Scholar
  19. 19.
    He, B., Wang, K., Liu, Y., Xue, B., Uversky, V.N., Dunker, A.K.: Predicting intrinsic disorder in proteins: an overview. Cell Res. 19(8), 929–949 (2009)CrossRefGoogle Scholar
  20. 20.
    He, H., Garcia, E., et al.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  21. 21.
    Herschtal, A., Raskutti, B.: Optimising area under the ROC curve using gradient descent. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 49. ACM (2004)Google Scholar
  22. 22.
    Hinton, G., Deng, L., Dong, Y., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  23. 23.
    Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 377–384. ACM (2005)Google Scholar
  24. 24.
    Jones, D.T., Cozzetto, D.: Disopred3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31(6), 857–863 (2015)CrossRefGoogle Scholar
  25. 25.
    Joo, K., Joung, I., Lee, S.Y., Kim, J.Y., Cheng, Q., Manavalan, B., Joung, J.Y., Heo, S., Lee, J., Nam, M., et al.: Template based protein structure modeling by global optimization in CASP11. Proteins: Struct., Funct., Bioinform. (2015)Google Scholar
  26. 26.
    Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983)CrossRefGoogle Scholar
  27. 27.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  28. 28.
    Kryshtafovych, A., Barbato, A., Fidelis, K., Monastyrskyy, B., Schwede, T., Tramontano, A.: Assessment of the assessment: evaluation of the model quality estimates in CASP10. Proteins: Struct., Funct., Bioinform. 82(S2), 112–126 (2014)CrossRefGoogle Scholar
  29. 29.
    Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)Google Scholar
  30. 30.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  31. 31.
    Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)MathSciNetCrossRefMATHGoogle Scholar
  32. 32.
    Liu, X.-Y., Jianxin, W., Zhou, Z.-H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man, Cybern. Part B (Cybern.) 39(2), 539–550 (2009)CrossRefGoogle Scholar
  33. 33.
    Ma, J., Wang, S.: Acconpred: Predicting solvent accessibility and contact number simultaneously by a multitask learning framework under the conditional neural fields model. BioMed Res. Int. 2015 (2015)Google Scholar
  34. 34.
    Magnan, C.N., Baldi, P.: SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 30(18), 2592–2597 (2014)CrossRefGoogle Scholar
  35. 35.
    Monastyrskyy, B., Fidelis, K., Moult, J., Tramontano, A., Kryshtafovych, A.: Evaluation of disorder predictions in CASP9. Proteins: Struct., Funct., Bioinform. 79(S10), 107–118 (2011)CrossRefGoogle Scholar
  36. 36.
    Narasimhan, H., Agarwal, S.: A structural SVM based approach for optimizing partial AUC. In: Proceedings of the 30th International Conference on Machine Learning, pp. 516–524 (2013)Google Scholar
  37. 37.
    Oldfield, C.J., Dunker, A.K.: Intrinsically disordered proteins and intrinsically disordered protein regions. Ann. Rev. Biochem. 83, 553–584 (2014)CrossRefGoogle Scholar
  38. 38.
    Pauling, L., Corey, R.B., Branson, H.R.: The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Nat. Acad. Sci. 37(4), 205–211 (1951)CrossRefGoogle Scholar
  39. 39.
    Peng, J., Bo, L., Xu, J.: Conditional neural fields. In: Advances in Neural Information Processing Systems, pp. 1419–1427 (2009)Google Scholar
  40. 40.
    Rosenfeld, N., Meshi, O., Globerson, A., Tarlow, D.: Learning structured models with the AUC loss and its generalizations. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 841–849 (2014)Google Scholar
  41. 41.
    Schlessinger, A., Punta, M., Rost, B.: Natively unstructured regions in proteins identified from contact predictions. Bioinformatics 23(18), 2376–2384 (2007)CrossRefGoogle Scholar
  42. 42.
    Sillitoe, I., Lewis, T.E., Cuff, A., Das, S., Ashford, P., Dawson, N.L., Furnham, N., Laskowski, R.A., Lee, D., Lees, J.G., et al.: Cath: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 43(D1), D376–D381 (2015)CrossRefGoogle Scholar
  43. 43.
    Wang, S., Li, W., Liu, S., Jinbo, X.: Raptorx-property: a web server for protein structure property prediction. Nucleic Acids Res., gkw306 (2016)Google Scholar
  44. 44.
    Wang, S., Peng, J., Ma, J., Jinbo, X.: Protein secondary structure prediction using deep convolutional neural fields. Sci. Rep. 6 (2016)Google Scholar
  45. 45.
    Wang, S., Weng, S., Ma, J., Tang, Q.: Deepcnf-d: predicting protein order/disorder regions by weighted deep convolutional neural fields. Int. J. Mol. Sci. 16(8), 17315–17330 (2015)CrossRefGoogle Scholar
  46. 46.
    Wang, Z., Zhao, F., Peng, J., Jinbo, X.: Protein 8-class secondary structure prediction using conditional neural fields. Proteomics 11(19), 3786–3792 (2011)CrossRefGoogle Scholar
  47. 47.
    Jinbo, X., Wang, S., Ma, J.: Protein Homology Detection Through Alignment of Markov Random Fields: Using MRFalign. SpringerBriefs in Computer Science. Springer, Heidelberg (2015)Google Scholar
  48. 48.
    Zhou, J., Troyanskaya, O.G.: Deep supervised, convolutional generative stochastic network for protein secondary structure prediction. arXiv preprint arXiv:1403.1347 (2014)

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Toyota Technological Institute at ChicagoChicagoUSA
  2. 2.University of ChicagoChicagoUSA

Personalised recommendations