A parallel classification framework for protein fold recognition

  • Elham Hekmatnia
  • Hedieh SajediEmail author
  • Ali Habib Agahi
Research Paper


Proteins’ tertiary structure, which is determined by its amino acid sequence via the protein folding process, have essential role in the function of protein. Protein fold recognition is one of the interesting studies in bioinformatics. In this paper, to address this issue, we propose a Feature Selection (FS) method based on Map_Reduce framework and Vortex Search Algorithm (VSA). FS is one of the most important steps of pre-processing data, which aims to select a variable subset of relevant features. In unparalleled mode and typical data, over hundreds of feature selection and dimension reduction algorithms have been provided such as Principle Component Analysis, Linear Discriminant Analysis, and so on. Nevertheless, these algorithms are not implemented for real-world applications when data instances increasing in three-dimensional: volume, velocity and variety that called Big Data, actually if we want to use previous feature selection methods on Big Data, volume of large and complex computing will be required. VSA was inspired from the vortex pattern created by the vortical flow of the stirred fluids. In Map_Reduce framework, Map and Reduce functions executed in parallel mode. In the proposed method, in each step of Map function, a VSA is employed to find an optimized subset of features and decrease feature search space. In the light of the above consideration, we evaluate the proposed method in classification of a benchmark dataset for protein fold recognition. The experimental results indicate that the proposed method improves prediction accuracy considerably.


Protein fold recognition Feature selection Map_Reduce Parallel computing Distributed computing Vortex search algorithm 



  1. 1.
    Abbasi E, Ghatee M, Shiri ME (2013) FRAN and RBF-PSO as two components of a hyper framework to recognize protein folds. Comput Biol Med 43(9):1182–1191CrossRefGoogle Scholar
  2. 2.
    Hashemi HB, Shakery A, Naeini MP, eds (2009) Protein fold pattern recognition using Bayesian ensemble of RBF neural networks. In: 2009 international conference of soft computing and pattern recognition. IEEEGoogle Scholar
  3. 3.
    Shenoy SR, Jayaram B (2010) Proteins: sequence to structure and function-current status. Curr Protein Pept Sci 11(7):498–514CrossRefGoogle Scholar
  4. 4.
    Lampros C, Papaloukas C, Exarchos K, Fotiadis DI, Tsalikakis D (2009) Improving the protein fold recognition accuracy of a reduced state-space hidden Markov model. Comput Biol Med 39(10):907–914CrossRefGoogle Scholar
  5. 5.
    Aram RZ, Charkari NM (2015) A two-layer classification framework for protein fold recognition. J Theor Biol 365:32–39MathSciNetzbMATHCrossRefGoogle Scholar
  6. 6.
    Ibrahim W, Abadeh MS (2017) Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J Theor Biol 421:1–15CrossRefGoogle Scholar
  7. 7.
    Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, AmsterdamzbMATHGoogle Scholar
  8. 8.
    Manyika J (2011) Big data: the next frontier for innovation, competition, and productivity. Accessed 11 Jan 2020
  9. 9.
    Gartner (2017) Big data. Accessed 11 Jan 2020
  10. 10.
    Shin K (ed) (2012) MapReduce algorithms for big data analysis. VLDB endowment. Springer, BerlinGoogle Scholar
  11. 11.
    Zikopoulos P, Eaton C (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, New YorkGoogle Scholar
  12. 12.
    Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, et al (2011) Challenges and opportunities with big data 2011-1Google Scholar
  13. 13.
    Kouzes RT, Anderson GA, Elbert ST, Gorton I, Gracio DK (2009) The changing paradigm of data-intensive computing. Computer 42(1):26–34CrossRefGoogle Scholar
  14. 14.
    Hey AJ, Tansley S, Tolle KM (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft research Redmond, WashingtonGoogle Scholar
  15. 15.
    Wang Q, Wang C, Ren K, Lou W, Li J (2010) Enabling public auditability and data dynamics for storage security in cloud computing. IEEE Trans Parallel Distrib Syst 22(5):847–859CrossRefGoogle Scholar
  16. 16.
    Oprea A, Reiter MK, Yang K (eds) (2005) Space-efficient block storage integrity. NDSS, San DiegoGoogle Scholar
  17. 17.
    Wang Q, Ren K, Yu S, Lou W (2011) Dependable and secure sensor data storage with dynamic integrity assurance. ACM Trans Sens Netw (TOSN) 8(1):9Google Scholar
  18. 18.
    García A, Bourov S, Hammad A, Hartmann V, Jejkal T, Otte JC, et al (2011) Data-intensive analysis for scientific experiments at the large scale data facility. In: 2011 IEEE symposium on large data analysis and visualization. IEEEGoogle Scholar
  19. 19.
    Simeonidou D, Nejabati R, Zervas G, Klonidis D, Tzanakaki A, O’Mahony MJ (2005) Dynamic optical-network architectures and technologies for existing and emerging grid services. J Lightwave Technol 23(10):3347CrossRefGoogle Scholar
  20. 20.
    Foster I, Zhao Y, Raicu I, Lu S (2008) Cloud computing and grid computing 360-degree compared. arXiv preprint arXiv:09010131
  21. 21.
    Furht B, Escalante A (2010) Handbook of cloud computing. Springer, BerlinzbMATHCrossRefGoogle Scholar
  22. 22.
    Alpaydin E (2010) Introduction to machine learning. The MIT Press, LondonzbMATHGoogle Scholar
  23. 23.
    Bikku T, Rao NS, Akepogu AR (2016) Hadoop based feature selection and decision making models on big data. Indian J Sci Technol. CrossRefGoogle Scholar
  24. 24.
    Ding CH, Dubchak I (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4):349–358CrossRefGoogle Scholar
  25. 25.
    Hou J, Adhikari B, Cheng J (2017) DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34(8):1295–1303CrossRefGoogle Scholar
  26. 26.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Commun ACM. CrossRefGoogle Scholar
  27. 27.
    Sudha P, Ramyachitra D, Manikandan P (2018) Enhanced artificial neural network for protein fold recognition and structural class prediction. Gene Rep 12:261–275CrossRefGoogle Scholar
  28. 28.
    Peyravi F, Latif A, Moshtaghioun SM (2019) A composite approach to protein tertiary structure prediction: hidden Markov model based on lattice. Bull Math Biol 81(3):899–918MathSciNetzbMATHCrossRefGoogle Scholar
  29. 29.
    García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F (2016) Big data preprocessing: methods and prospects. Big Data Anal 1(1):9CrossRefGoogle Scholar
  30. 30.
    Fernández A, del Río S, López V, Bawakid A, del Jesus MJ, Benítez JM et al (2014) Big data with cloud computing: an insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdiscip Rev Data Min Knowl Discov 4(5):380–409CrossRefGoogle Scholar
  31. 31.
    White T (2012) Hadoop: the definitive guide. O’Reilly Media Inc., SebastopolGoogle Scholar
  32. 32.
    Apache Hadoop Project (2015) Apache HadoopGoogle Scholar
  33. 33.
    Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O’Reilly Media Inc, SebastopolGoogle Scholar
  34. 34.
    Spark A (2015) Lightning-fast cluster computing. Apache Spark: official websiteGoogle Scholar
  35. 35.
    Liu H, Motoda H (2007) Computational methods of feature selection. CRC Press, Boca RatonzbMATHCrossRefGoogle Scholar
  36. 36.
    Razavi SF, Sajedi H (2019) SVSA: a semi vortex search algorithm for solving optimization problems. Int J Data Sci Anal 8(1):15–32CrossRefGoogle Scholar
  37. 37.
    Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324zbMATHCrossRefGoogle Scholar
  38. 38.
    Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422zbMATHCrossRefGoogle Scholar
  39. 39.
    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182zbMATHGoogle Scholar
  40. 40.
    Tauer G, Nagi R (2013) A map-reduce lagrangian heuristic for multidimensional assignment problems with decomposable costs. Parallel Comput 39(11):653–668CrossRefGoogle Scholar
  41. 41.
    UzZaman N (2007) Survey on Google file system. Survey Paper for CSC. p 456Google Scholar
  42. 42.
    Qian J, Lv P, Yue X, Liu C, Jing Z (2015) Hierarchical attribute reduction algorithms for big data using MapReduce. Knowl Based Syst 73:18–31CrossRefGoogle Scholar
  43. 43.
    Xu Y, Qu W, Li Z, Liu Z, Ji C, Li Y et al (2014) Balancing reducer workload for skewed data using sampling-based partitioning. Comput Electr Eng 40(2):675–687CrossRefGoogle Scholar
  44. 44.
    Rastrigin L (1963) The convergence of the random search method in the extremal control of a many parameter system. Autom Remote Control 24:1337–1342Google Scholar
  45. 45.
    Schumer M, Steiglitz K (1968) Adaptive step size random search. IEEE Trans Autom Control 13(3):270–276CrossRefGoogle Scholar
  46. 46.
    Schrack G, Choit M (1976) Optimized relative step size random searches. Math Progr 10(1):230–244MathSciNetzbMATHCrossRefGoogle Scholar
  47. 47.
    Sajedi H, Razavi SF (2016) MVSA: multiple vortex search algorithm. In: 2016 IEEE 17th international symposium on computational intelligence and informatics (CINTI), HungaryGoogle Scholar
  48. 48.
    Göktepe YE, Kodaz H (2018) Prediction of protein–protein interactions using an effective sequence based combined method. Neurocomputing 303:68–74CrossRefGoogle Scholar
  49. 49.
    Doğan B, Ölmez T (2015) A new metaheuristic for numerical function optimization: vortex search algorithm. Inf Sci 293:125–145CrossRefGoogle Scholar
  50. 50.
    Hooda N, Seema B, Prashant SR (2018) Fraudulent firm classification: a case study of an external audit. Appl Artif Intell 32(1):48–64CrossRefGoogle Scholar
  51. 51.
    Göktepe YE, İlhan İ, Kahramanlı Ş (2016) Predicting protein–protein interactions by weighted pseudo amino acid composition. Int J Data Min Bioinform 15(3):272–290CrossRefGoogle Scholar
  52. 52.
    Sakar CO, Serbes G, Gunduz A, Tunc HC, Nizam H, Sakar BE, Tutuncu M, Aydin T, Isenkul ME, Apaydin H (2019) A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform. Appl Soft Comput 74:255–263CrossRefGoogle Scholar
  53. 53.
    Shen H-B, Chou K-C (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics 22(14):1717–1722CrossRefGoogle Scholar
  54. 54.
    Nanni L (2006) A novel ensemble of classifiers for protein fold recognition. Neurocomputing 69(16–18):2434–2437CrossRefGoogle Scholar
  55. 55.
    Nanni L (2006) Ensemble of classifiers for protein fold recognition. Neurocomputing 69(7–9):850–853CrossRefGoogle Scholar
  56. 56.
    Chen Y, Chen F, Yang JY, Yang MQ (2008) Ensemble voting system for multiclass protein fold recognition. Int J Pattern Recognit Artif Intell 22(04):747–763CrossRefGoogle Scholar
  57. 57.
    Guo X, Gao X (2008) A novel hierarchical ensemble classifier for protein fold recognition. Protein Eng Des Sel 21(11):659–664CrossRefGoogle Scholar
  58. 58.
    Chmielnicki W, Sta K (2012) A hybrid discriminative/generative approach to protein fold recognition. Neurocomputing 75(1):194–198CrossRefGoogle Scholar
  59. 59.
    Martin S, Roe D, Faulon J-L (2004) Predicting protein–protein interactions using signature products. Bioinformatics 21(2):218–226CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Authors and Affiliations

  • Elham Hekmatnia
    • 1
  • Hedieh Sajedi
    • 2
    Email author
  • Ali Habib Agahi
    • 3
  1. 1.Department of Computer Engineering, School of EngineeringAzad University, Science and Research BranchTehranIran
  2. 2.Department of Computer Science, School of Mathematics, Statistics and Computer Science, College of ScienceUniversity of TehranTehranIran
  3. 3.Department of Computer Engineering, School of EngineeringAzad University, South Tehran BranchTehranIran

Personalised recommendations