Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud

  • Dariusz Mrozek
Part of the Computational Biology book series (COBO, volume 28)


Intrinsically disordered proteins (IDPs) constitute a wide range of molecules that act in cells of living organisms and mediate many protein–protein interactions and many regulatory processes. Computational identification of disordered regions in protein amino acid sequences, thus, became an important branch of 3D protein structure prediction and modeling. In this chapter, we will see the IDP meta-predictor that applies an ensemble of primary predictors in order to increase the quality of IDP prediction. We will also see the highly scalable implementation of the meta-predictor on the Spark cluster (Spark-IDPP) that mitigates the problem of the exponentially growing number of protein amino acid sequences in public repositories. Spark-IDPP responds very well to the current needs of IDP prediction by parallelizing computations on the Spark cluster that can be scaled on demand on the Microsoft Azure cloud according to particular requirements for computing power.


Proteins 3D protein structure Tertiary structure Intrinsically disordered proteins Cloud computing Parallel computing Spark Microsoft Azure Public cloud 


  1. 1.
    Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997). Scholar
  2. 2.
    Bai, C., Dhavale, D., Sarkis, J.: Complex investment decisions using rough set and fuzzy c-means: an example of investment in green supply chains. Eur. J. Oper. Res. 248(2), 507–521 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Baron, T.: Prediction of intrinsically disordered proteins in Apache Spark. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2016)Google Scholar
  4. 4.
    Bayer, P., Arndt, A., Metzger, S., Mahajan, R., Melchior, F., Jaenicke, R., Becker, J.: Structure determination of the small ubiquitin-related modifier SUMO-1. J. Mol. Biol. 280(2), 275–286 (1998). Scholar
  5. 5.
    Benson, D.A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: GenBank. Nucleic Acids Res. 45(D1), D37–D42 (2017). Scholar
  6. 6.
    Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRefGoogle Scholar
  7. 7.
    Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bansal, P., Bridge, A.J., Poux, S., Bougueleret, L., Xenarios, I.: UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View, pp. 23–54. Springer, New York (2016)CrossRefGoogle Scholar
  8. 8.
    Ceri, S., Kaitoua, A., Masseroli, M., Pinoli, P., Venco, F.: Data management for heterogeneous genomic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform. 99, 1–1 (2016)Google Scholar
  9. 9.
    Chang, H., Mishra, N., Lin, C.: IoT Big-Data centred knowledge granule analytic and cluster framework for BI applications: a case base analysis. Plos One 10, 1–23 (2015)Google Scholar
  10. 10.
    Cheng, J., Sweredoski, M.J., Baldi, P.: Accurate prediction of protein disordered regions by mining protein structure data. Data Min. Knowl. Discov. 11(3), 213–222 (2005), Scholar
  11. 11.
    Cupek, R., Ziebinski, A., Huczala, L., Erdogan, H.: Agent-based manufacturing execution systems for short-series production scheduling. Comput. Ind. 82, 245–258 (2016)CrossRefGoogle Scholar
  12. 12.
    Czerniak, J.M., Dobrosielski, W.T., Apiecionek, Ł., Ewald, D.: Representation of a trend in OFN during fuzzy observance of the water level from the Crisis control center. In: Proceedings of the 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 443–447 (2015)Google Scholar
  13. 13.
    Davis, G.B., Carley, K.M.: Clearing the fog: fuzzy, overlapping groups for social networks. Soc. Netw. 30(3), 201–212 (2008)CrossRefGoogle Scholar
  14. 14.
    De Maio, C., Fenza, G., Loia, V., Parente, M.: Time aware knowledge extraction for microblog summarization on Twitter. Inf. Fus. 28, 60–74 (2016)CrossRefGoogle Scholar
  15. 15.
    Dosztányi, Z., Csizmok, V., Tompa, P., Simon, I.: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21(16), 3433–3434 (2005). Scholar
  16. 16.
    Dunker, A.K., Silman, I., Uversky, V.N., Sussman, J.L.: Function and structure of inherently disordered proteins. Curr. Opin. Struct. Biol. 18(6), 756–764 (2008)CrossRefGoogle Scholar
  17. 17.
    Feng, X., Grossman, R., Stein, L.: PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinform.12(1), 1–11 (2011), Scholar
  18. 18.
    Guo, K., Zhang, R., Kuang, L.: TMR: towards an efficient semantic-based heterogeneous transportation media Big Data retrieval. Neurocomputing 181, 122–131 (2016)CrossRefGoogle Scholar
  19. 19.
    Hazelhurst, S.: PH2: an Hadoop-based framework for mining structural properties from the PDB database. In: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 104–112 (2010)Google Scholar
  20. 20.
    Hirose, S., Shimizu, K., Kanai, S., Kuroda, Y., Noguchi, T.: POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics 23(16), 2046–2053 (2007). Scholar
  21. 21.
    Hu, C., Ren, G., Liu, C., Li, M., Jie, W.: A Spark-based genetic algorithm for sensor placement in large scale drinking water distribution systems. Clust. Comput. 20(2), 1089–1099 (2017). Scholar
  22. 22.
    Hung, C.L., Hua, G.J.: Cloud Computing for protein-ligand binding site comparison. Biomed Res. Int. 170356 (2013)Google Scholar
  23. 23.
    Hung, C.L., Lin, C.Y.: Open reading frame phylogenetic analysis on the cloud. Int. J. Genomics 2013(614923), 1–9 (2013)Google Scholar
  24. 24.
    Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on cloud. Int. J. Genomics 439681, 1–8 (2013)Google Scholar
  25. 25.
    Ishida, T., Kinoshita, K.: PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res. 35(suppl\(\_\)2), W460–W464 (2007). Scholar
  26. 26.
    Jensen, K., Nguyen, H.T., Do, T.V., Årnes, A.: a big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017). Scholar
  27. 27.
    Jin, Y., Dunbrack, R.: Assessment of disorder predictions in CASP6. Proteins 61, 167–175 (2005)CrossRefGoogle Scholar
  28. 28.
    Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1987)CrossRefGoogle Scholar
  29. 29.
    Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), 1–13 (2010). Scholar
  30. 30.
    Kozlowski, L.P., Bujnicki, J.M.: MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinform. 13(1), 111 (2012). Scholar
  31. 31.
    Langmead, B., Hansen, K.D., Leek, J.T.: Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11(8), 1–11 (2010). Scholar
  32. 32.
    Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with Cloud computing. Genome Biol. 10(11), 1–10 (2009). Scholar
  33. 33.
    Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., et al.: Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinform. 13, 324 (2012)CrossRefGoogle Scholar
  34. 34.
    Linding, R., Jensen, L.J., Diella, F., Bork, P., Gibson, T.J., Russell, R.B.: Protein disorder prediction: implications for structural proteomics. Structure 11(11), 1453–1459 (2003). Scholar
  35. 35.
    Linding, R., Russell, R.B., Neduva, V., Gibson, T.J.: GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res. 31(13), 3701–3708 (2003). Scholar
  36. 36.
    Lipman, D., Pearson, W.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985)CrossRefGoogle Scholar
  37. 37.
    Lu, H., Sun, Z., Qu, W.: Big Data-driven based real-time traffic flow state identification and prediction. Discret. Dyn. Nat. Soc. 2015, 1–11 (2015)MathSciNetGoogle Scholar
  38. 38.
    Lu, H., Sun, Z., Qu, W., Wang, L.: Real-time corrected traffic correlation model for traffic flow forecasting. Math. Probl. Eng. 2015, 1–7 (2015)Google Scholar
  39. 39.
    Mahmud, S., Iqbal, R., Doctor, F.: Cloud enabled data analytics and visualization framework for health-shocks prediction. Future Gener. Comput. Syst. 65, 169–181 (2016). (special Issue on Big Data in the Cloud)CrossRefGoogle Scholar
  40. 40.
    Małysiak-Mrozek, B., Baron, T., Mrozek, D.: Spark-IDPP: High throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud, J. Clus. Comp, 1–35 (in review)Google Scholar
  41. 41.
    Małysiak-Mrozek, B., Stabla, M., Mrozek, D.: Soft and declarative fishing of information in Big Data lake. IEEE Trans. Fuzzy Syst. 99, 1–1 (2018)Google Scholar
  42. 42.
    Małysiak-Mrozek, B., Zur, K., Mrozek, D.: In-memory management system for 3D protein macromolecular structures. Curr. Proteomics 15 (2018). Scholar
  43. 43.
    Matsunaga, A., Tsugawa, M., Fortes, J.: Cloudblast: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: Proceedings of the IEEE Fourth International Conference on eScience (ESCIENCE ’08), pp. 222–229 (2008)Google Scholar
  44. 44.
    Matthews, S.J., Williams, T.L.: MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinform. 11(1), 1–9 (2010). Scholar
  45. 45.
    Meng, L., Tan, A., Wunsch, D.: Adaptive scaling of cluster boundaries for large-scale social media data clustering. IEEE Trans. Neural Netw. Learn. 27(12), 2656–2669 (2015)CrossRefGoogle Scholar
  46. 46.
    Mrozek, D.: High-Performance Computational Solutions in Protein Bioinformatics. SpringerBriefs in Computer Science. Springer International Publishing, Cham (2014)CrossRefGoogle Scholar
  47. 47.
    Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016)CrossRefGoogle Scholar
  48. 48.
    Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling Ab Initio predictions of 3D protein structures in Microsoft Azure cloud. J Grid Comput. 13, 561–585 (2015)CrossRefGoogle Scholar
  49. 49.
    Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerating 3D protein structure similarity searching on Microsoft Azure Cloud with local replicas of macromolecular data. In: Wyrzykowski, R. (ed.) Parallel Processing and Applied Mathematics - PPAM 2015. Lecture Notes in Computer Science, vol. 9574, pp. 1–12. Springer, Heidelberg (2016)Google Scholar
  50. 50.
    Mrozek, D., Małysiak-Mrozek, B., Kłapciński, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014)CrossRefGoogle Scholar
  51. 51.
    Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kozielski, S.: Life sciences data analysis. Inform. Sci. 384, 86–89 (2017)zbMATHCrossRefGoogle Scholar
  52. 52.
    Piovesan, D., Tabaro, F., Mičetić, I., Necci, M., Quaglia, F., Oldfield, C.J., Aspromonte, M.C., Davey, N.E., Davidović, R., Dosztányi, Z., Elofsson, A., Gasparini, A., Hatos, A., Kajava, A.V., Kalmar, L., Leonardi, E., Lazar, T., Macedo-Ribeiro, S., Macossay-Castillo, M., Meszaros, A., Minervini, G., Murvai, N., Pujols, J., Roche, D.B., Salladini, E., Schad, E., Schramm, A., Szabo, B., Tantos, A., Tonello, F., Tsirigos, K.D., Veljković, N., Ventura, S., Vranken, W., Warholm, P., Uversky, V.N., Dunker, A.K., Longhi, S., Tompa, P., Tosatto, S.C.: DisProt 7.0: a major update of the database of disordered proteins. Nucleic Acids Res. 45(D1), D219–D227 (2017). Scholar
  53. 53.
    Powers, D.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Int. J. Mach. Learn. Technol. 2, 37–63 (2011)CrossRefGoogle Scholar
  54. 54.
    Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., Gannon, D.: Cloud technologies for bioinformatics applications. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pp. 6:1–6:10. MTAGS ’09, ACM, New York, NY, USA (2009).
  55. 55.
    Radenski, A., Ehwerhemuepha, L.: Speeding-up codon analysis on the cloud with local MapReduce aggregation. Inf. Sci. 263, 175–185 (2014)CrossRefGoogle Scholar
  56. 56.
    Rose, A.S., Hildebrand, P.W.: NGL viewer: a web application for molecular visualization. Nucleic Acids Res. 43(W1), W576–W579 (2015). Scholar
  57. 57.
    Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009)CrossRefGoogle Scholar
  58. 58.
    Shimizu, K., Hirose, S., Noguchi, T.: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics 23(17), 2337–2338 (2007). Scholar
  59. 59.
    Sickmeier, M., Hamilton, J.A., LeGall, T., Vacic, V., Cortese, M.S., Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V.N., Obradovic, Z., Dunker, A.K.: DisProt: the database of disordered proteins. Nucleic Acids Res. 35\((\text{suppl}\_1)\), D786–D793 (2007). Scholar
  60. 60.
    Su, C.T., Chen, C.Y., Hsu, C.M.: iPDA: integrated protein disorder analyzer. Nucleic Acids Res. 35(suppl\({\_}\)2), W465–W472 (2007). Scholar
  61. 61.
    Teijeiro, D., Pardo, X.C., Penas, D.R., González, P., Banga, J.R., Doallo, R.: A cloud-based enhanced differential evolution algorithm for parameter estimation problems in computational systems biology. Clust. Comput. 20(3), 1937–1950 (2017). Scholar
  62. 62.
    The UniProt consortium: Uniprot: the universal protein knowledgebase. Nucleic Acids Res. 45(D1), D158–D169 (2017).
  63. 63.
    Tripathy, B.K., Mittal, D.: Hadoop based uncertain possibilistic kernelized c-means algorithms for image segmentation and a comparative analysis. Appl. Soft Comput. 46, 886–923 (2016)CrossRefGoogle Scholar
  64. 64.
    Vullo, A., Bortolami, O., Pollastri, G., Tosatto, S.C.E.: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 34\((\text{ suppl }\_2)\), W164–W168 (2006). Scholar
  65. 65.
    Wang, H., Li, J., Hou, Z., Fang, R., Mei, W., Huang, J.: Research on parallelized real-time map matching algorithm for massive GPS data. Clust. Comput. 20(2), 1123–1134 (2017). Scholar
  66. 66.
    Wang, C., Li, X., Zhou, X., Wang, A., Nedjah, N.: Soft computing in Big Data intelligent transportation systems. Appl. Soft Comput. 38, 1099–1108 (2016)CrossRefGoogle Scholar
  67. 67.
    Wang, Z., Tu, L., Guo, Z., Yang, L.T., Huang, B.: Analysis of user behaviors by mining large network data sets. Future Gener. Comput. Syst. 37, 429–437 (2014)CrossRefGoogle Scholar
  68. 68.
    Ward, J.J., McGuffin, L.J., Bryson, K., Buxton, B.F., Jones, D.T.: The DISOPRED server for the prediction of protein disorder. Bioinformatics 20(13), 2138–2139 (2004). Scholar
  69. 69.
    Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016). Scholar
  70. 70.
    Xue, B., Dunbrack, R.L., Williams, R.W., Dunker, A.K., Uversky, V.N.: Pondr-fit: a meta-predictor of intrinsically disordered amino acids. Biochim. Biophys. Acta (BBA) - Proteins Proteomics 1804(4), 996–1010 (2010). Scholar
  71. 71.
    Yang, C.T., Chen, S.T., Yan, Y.Z.: The implementation of a cloud city traffic state assessment system using a novel big data architecture. Clust. Comput. 20(2), 1101–1121 (2017). Scholar
  72. 72.
    Yang, Z.R., Thomson, R., McNeil, P., Esnouf, R.M.: RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21(16), 3369–3376 (2005). Scholar
  73. 73.
    Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). Scholar
  74. 74.
    Zhang, T., Faraggi, E., Li, Z., Zhou, Y.: Intrinsic disorder and Semi-disorder prediction by SPINE-D, pp. 159–174. Springer, New York (2017). Scholar
  75. 75.
    Zhong, Y., Zhang, L., Xing, S., Li, F., Wan, B.: The Big Data processing algorithm for water environment monitoring of the three gorges reservoir area. Abstr. Appl. Anal. 2014 (2014)Google Scholar
  76. 76.
    Zou, Q., Hu, Q., Guo, M., Wang, G.: HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Silesian University of TechnologyGliwicePoland

Personalised recommendations