Abstract
With decreasing cost of biological sequencing, the influx of new sequences into biological databases such as NCBI, SwissProt, UniProt is increasing at an ever-growing pace. Annotating these newly sequenced proteins will aid in ground breaking discoveries for developing novel drugs and potential therapies for diseases. Previous work in this field has harnessed the high computational power of modern machines to achieve good prediction quality but at the cost of high dimensionality. To address this disparity, we propose a novel word segmentation-based feature selection strategy to classify protein sequences using a highly condensed feature set. Using an incremental classifier selection strategy was seen to yield better results than all existing methods. The antioxidant protein data curated in the previous work was used in order to facilitate a level ground for evaluation and comparison of results. The proposed method was found to outperform all existing works on this data with an accuracy of 95%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Patil, N., Toshniwal, D., Garg, K.: Effective framework for protein structure prediction. Int. J. Func. Inform. Pers. Med. 4(1), 69–79 (2012)
Valko, M., Rhodes, C.J., Moncola, J., Izakovic, M., Mazur, M.: Free radicals, metals and antioxidants in oxidative stress-induced cancer. Chem.-Biol. Interact. 160, 1–40 (2006). https://doi.org/10.1016/j.cbi.2005.12.009 PMID: 16430879
DNA, RNA and Protein: The Central Dogma. http://science-explained.com/theory/dna-rna-and-protein
Feng, P.M., Lin, H., Chen, W.: Identification of antioxidants from sequence information using naïve Bayes. Comput. Math. Methods Med. 2013: 567529 (2013). https://doi.org/10.1155/2013/567529 PMID: 24062796
Zhang, L.N., Zhang, C.J., Gao, R., Yang, R.T.: Incorporating g-gap dipeptide composition and position specific scoring matrix for identifying antioxidant proteins. In: 28th IEEE Canadian Conference on Electrical and Computer Engineering, Halifax, Canada (2015)
Zhang, L., Zhang, C., Gao, R., Yang, R., Song, Q.: Sequence based prediction of antioxidant proteins using a classifier selection strategy. PLoS ONE 11(9), e0163274 (2016). https://doi.org/10.1371/journal.pone.0163274
Yang, Y., Lu, B.-L., Yang, W.-Y.: Classification of protein sequences based on word segmentation methods. In: 2007 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2007)
K-mer Counting—A 2014 Recap. http://homolog.us/blogs/kmer-counting-a-2014-recap/
TF-IDF and Log Entropy Model. http://stats.stackexchange.com/difference-between-log-entropy-model-and-tf-idf-model
Radial Basis Function Network Tutorial. http://mccormickml.com/2013/08/radial-basis-function-network-rbfn-tutorial/
Machine Learning Algorithms for Classification. www.cs.princeton.edu/picasso-minicourse.html
Naive Bayes Multinomial Text Classification made easy document. http://nlp.stanford.edu/htmledition/naive-bayes-text-classification-1.html/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sriram, A., Sanapala, M., Patel, R., Patil, N. (2018). Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy. In: Sa, P., Bakshi, S., Hatzilygeroudis, I., Sahoo, M. (eds) Recent Findings in Intelligent Computing Techniques . Advances in Intelligent Systems and Computing, vol 708. Springer, Singapore. https://doi.org/10.1007/978-981-10-8636-6_18
Download citation
DOI: https://doi.org/10.1007/978-981-10-8636-6_18
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8635-9
Online ISBN: 978-981-10-8636-6
eBook Packages: EngineeringEngineering (R0)