Protein Secondary Structure Prediction Using Machine Learning

  • Sriparna Saha
  • Asif Ekbal
  • Sidharth Sharma
  • Sanghamitra Bandyopadhyay
  • Ujjwal Maulik
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 182)

Abstract

Protein structure prediction is an important component in understanding protein structures and functions. Accurate prediction of protein secondary structure helps in understanding protein folding. In many applications such as drug discovery it is required to predict the secondary structure of unknown proteins. In this paper we report our first attempt to secondary structure predication, and approach it as a sequence classification problem, where the task is equivalent to assigning a sequence of labels (i.e. helix, sheet, and coil) to the given protein sequence. We propose an ensemble technique that is based on two stochastic supervised machine learning algorithms, namely Maximum Entropy Markov Model (MEMM) and Conditional Random Field (CRF). We identify and implement a set of features that mostly deal with the contextual information. The proposed approach is evaluated with a benchmark dataset, and it yields encouraging performance to explore it further. We obtain the highest predictive accuracy of 61.26% and segment overlap score (SOV) of 52.30%.

Keywords

Benchmark Dataset Conditional Random Field Protein Secondary Structure Protein Structure Prediction Weighted Vote 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Darroch and Ratcliff(1972)]
    Darroch, J., Ratcliff, D.: Generalized Iterative Scaling for Log-linear Models. Ann. Math. Statistics 43, 1470–1480 (1972)MathSciNetMATHCrossRefGoogle Scholar
  2. [Lafferty et al(2001)Lafferty, McCallum, and Pereira]
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic framework for segmenting and labelling sequence data. In: 18th International Conference on Maching Learning, pp. 282–289. Morgan Kaufmann, San Franciso (2001)Google Scholar
  3. [Thorton(2001)]
    Thorton, J.M.: From genome to function. Science 292, 2095–2097 (2001)CrossRefGoogle Scholar
  4. [Zemla et al(1999)Zemla, Venclovas, Fidelis, and Rost]
    Zemla, A., Venclovas, C., Fidelis, K., Rost, B.: A modified definition of sov, a segment-based measure for protein secondary structure prediction assessment. PROTEINS: Structure, Function, and Genetics 34, 220–223 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Sriparna Saha
    • 1
  • Asif Ekbal
    • 1
  • Sidharth Sharma
    • 1
  • Sanghamitra Bandyopadhyay
    • 2
  • Ujjwal Maulik
    • 3
  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology PatnaPatnaIndia
  2. 2.Machine Intelligence UnitIndian Statistical InstituteKolkataIndia
  3. 3.Department of Computer Science and EngineeringJadavpur UniversityKolkataIndia

Personalised recommendations