Skip to main content

Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud

  • Chapter
  • First Online:

Part of the book series: Computational Biology ((COBO,volume 28))

Abstract

Intrinsically disordered proteins (IDPs) constitute a wide range of molecules that act in cells of living organisms and mediate many protein–protein interactions and many regulatory processes. Computational identification of disordered regions in protein amino acid sequences, thus, became an important branch of 3D protein structure prediction and modeling. In this chapter, we will see the IDP meta-predictor that applies an ensemble of primary predictors in order to increase the quality of IDP prediction. We will also see the highly scalable implementation of the meta-predictor on the Spark cluster (Spark-IDPP) that mitigates the problem of the exponentially growing number of protein amino acid sequences in public repositories. Spark-IDPP responds very well to the current needs of IDP prediction by parallelizing computations on the Spark cluster that can be scaled on demand on the Microsoft Azure cloud according to particular requirements for computing power.

Big data is mostly about taking numbers and using those numbers to make predictions about the future. The bigger the data set you have, the more accurate the predictions about the future will be

Anthony Goldbloom

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997). https://doi.org/10.1093/nar/25.17.3389

    Article  Google Scholar 

  2. Bai, C., Dhavale, D., Sarkis, J.: Complex investment decisions using rough set and fuzzy c-means: an example of investment in green supply chains. Eur. J. Oper. Res. 248(2), 507–521 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  3. Baron, T.: Prediction of intrinsically disordered proteins in Apache Spark. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2016)

    Google Scholar 

  4. Bayer, P., Arndt, A., Metzger, S., Mahajan, R., Melchior, F., Jaenicke, R., Becker, J.: Structure determination of the small ubiquitin-related modifier SUMO-1. J. Mol. Biol. 280(2), 275–286 (1998). http://www.sciencedirect.com/science/article/pii/S0022283698918393

    Article  Google Scholar 

  5. Benson, D.A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: GenBank. Nucleic Acids Res. 45(D1), D37–D42 (2017). https://doi.org/10.1093/nar/gkw1070

    Article  Google Scholar 

  6. Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)

    Article  Google Scholar 

  7. Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bansal, P., Bridge, A.J., Poux, S., Bougueleret, L., Xenarios, I.: UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View, pp. 23–54. Springer, New York (2016)

    Chapter  Google Scholar 

  8. Ceri, S., Kaitoua, A., Masseroli, M., Pinoli, P., Venco, F.: Data management for heterogeneous genomic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform. 99, 1–1 (2016)

    Google Scholar 

  9. Chang, H., Mishra, N., Lin, C.: IoT Big-Data centred knowledge granule analytic and cluster framework for BI applications: a case base analysis. Plos One 10, 1–23 (2015)

    Google Scholar 

  10. Cheng, J., Sweredoski, M.J., Baldi, P.: Accurate prediction of protein disordered regions by mining protein structure data. Data Min. Knowl. Discov. 11(3), 213–222 (2005), https://doi.org/10.1007/s10618-005-0001-y

    Article  MathSciNet  Google Scholar 

  11. Cupek, R., Ziebinski, A., Huczala, L., Erdogan, H.: Agent-based manufacturing execution systems for short-series production scheduling. Comput. Ind. 82, 245–258 (2016)

    Article  Google Scholar 

  12. Czerniak, J.M., Dobrosielski, W.T., Apiecionek, Ł., Ewald, D.: Representation of a trend in OFN during fuzzy observance of the water level from the Crisis control center. In: Proceedings of the 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 443–447 (2015)

    Google Scholar 

  13. Davis, G.B., Carley, K.M.: Clearing the fog: fuzzy, overlapping groups for social networks. Soc. Netw. 30(3), 201–212 (2008)

    Article  Google Scholar 

  14. De Maio, C., Fenza, G., Loia, V., Parente, M.: Time aware knowledge extraction for microblog summarization on Twitter. Inf. Fus. 28, 60–74 (2016)

    Article  Google Scholar 

  15. Dosztányi, Z., Csizmok, V., Tompa, P., Simon, I.: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21(16), 3433–3434 (2005). https://doi.org/10.1093/bioinformatics/bti541

    Article  Google Scholar 

  16. Dunker, A.K., Silman, I., Uversky, V.N., Sussman, J.L.: Function and structure of inherently disordered proteins. Curr. Opin. Struct. Biol. 18(6), 756–764 (2008)

    Article  Google Scholar 

  17. Feng, X., Grossman, R., Stein, L.: PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinform.12(1), 1–11 (2011), https://doi.org/10.1186/1471-2105-12-139

    Article  Google Scholar 

  18. Guo, K., Zhang, R., Kuang, L.: TMR: towards an efficient semantic-based heterogeneous transportation media Big Data retrieval. Neurocomputing 181, 122–131 (2016)

    Article  Google Scholar 

  19. Hazelhurst, S.: PH2: an Hadoop-based framework for mining structural properties from the PDB database. In: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 104–112 (2010)

    Google Scholar 

  20. Hirose, S., Shimizu, K., Kanai, S., Kuroda, Y., Noguchi, T.: POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics 23(16), 2046–2053 (2007). https://doi.org/10.1093/bioinformatics/btm302

    Article  Google Scholar 

  21. Hu, C., Ren, G., Liu, C., Li, M., Jie, W.: A Spark-based genetic algorithm for sensor placement in large scale drinking water distribution systems. Clust. Comput. 20(2), 1089–1099 (2017). https://doi.org/10.1007/s10586-017-0838-z

    Article  Google Scholar 

  22. Hung, C.L., Hua, G.J.: Cloud Computing for protein-ligand binding site comparison. Biomed Res. Int. 170356 (2013)

    Google Scholar 

  23. Hung, C.L., Lin, C.Y.: Open reading frame phylogenetic analysis on the cloud. Int. J. Genomics 2013(614923), 1–9 (2013)

    Google Scholar 

  24. Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on cloud. Int. J. Genomics 439681, 1–8 (2013)

    Google Scholar 

  25. Ishida, T., Kinoshita, K.: PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res. 35(suppl\(\_\)2), W460–W464 (2007). https://doi.org/10.1093/nar/gkm363

    Article  Google Scholar 

  26. Jensen, K., Nguyen, H.T., Do, T.V., Årnes, A.: a big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017). https://doi.org/10.1007/s10586-017-0811-x

    Article  Google Scholar 

  27. Jin, Y., Dunbrack, R.: Assessment of disorder predictions in CASP6. Proteins 61, 167–175 (2005)

    Article  Google Scholar 

  28. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1987)

    Article  Google Scholar 

  29. Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), 1–13 (2010). https://doi.org/10.1186/gb-2010-11-11-r116

    Article  Google Scholar 

  30. Kozlowski, L.P., Bujnicki, J.M.: MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinform. 13(1), 111 (2012). https://doi.org/10.1186/1471-2105-13-111

    Article  Google Scholar 

  31. Langmead, B., Hansen, K.D., Leek, J.T.: Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11(8), 1–11 (2010). https://doi.org/10.1186/gb-2010-11-8-r83

    Article  Google Scholar 

  32. Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with Cloud computing. Genome Biol. 10(11), 1–10 (2009). https://doi.org/10.1186/gb-2009-10-11-r134

    Article  Google Scholar 

  33. Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., et al.: Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinform. 13, 324 (2012)

    Article  Google Scholar 

  34. Linding, R., Jensen, L.J., Diella, F., Bork, P., Gibson, T.J., Russell, R.B.: Protein disorder prediction: implications for structural proteomics. Structure 11(11), 1453–1459 (2003). http://www.sciencedirect.com/science/article/pii/S0969212603002351

    Article  Google Scholar 

  35. Linding, R., Russell, R.B., Neduva, V., Gibson, T.J.: GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res. 31(13), 3701–3708 (2003). https://doi.org/10.1093/nar/gkg519

    Article  Google Scholar 

  36. Lipman, D., Pearson, W.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985)

    Article  Google Scholar 

  37. Lu, H., Sun, Z., Qu, W.: Big Data-driven based real-time traffic flow state identification and prediction. Discret. Dyn. Nat. Soc. 2015, 1–11 (2015)

    MathSciNet  Google Scholar 

  38. Lu, H., Sun, Z., Qu, W., Wang, L.: Real-time corrected traffic correlation model for traffic flow forecasting. Math. Probl. Eng. 2015, 1–7 (2015)

    Google Scholar 

  39. Mahmud, S., Iqbal, R., Doctor, F.: Cloud enabled data analytics and visualization framework for health-shocks prediction. Future Gener. Comput. Syst. 65, 169–181 (2016). http://www.sciencedirect.com/science/article/pii/S0167739X15003271. (special Issue on Big Data in the Cloud)

    Article  Google Scholar 

  40. Małysiak-Mrozek, B., Baron, T., Mrozek, D.: Spark-IDPP: High throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud, J. Clus. Comp, 1–35 (in review)

    Google Scholar 

  41. Małysiak-Mrozek, B., Stabla, M., Mrozek, D.: Soft and declarative fishing of information in Big Data lake. IEEE Trans. Fuzzy Syst. 99, 1–1 (2018)

    Google Scholar 

  42. Małysiak-Mrozek, B., Zur, K., Mrozek, D.: In-memory management system for 3D protein macromolecular structures. Curr. Proteomics 15 (2018). https://doi.org/10.2174/1570164615666180320151452

    Article  Google Scholar 

  43. Matsunaga, A., Tsugawa, M., Fortes, J.: Cloudblast: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: Proceedings of the IEEE Fourth International Conference on eScience (ESCIENCE ’08), pp. 222–229 (2008)

    Google Scholar 

  44. Matthews, S.J., Williams, T.L.: MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinform. 11(1), 1–9 (2010). https://doi.org/10.1186/1471-2105-11-S1-S15

    Article  Google Scholar 

  45. Meng, L., Tan, A., Wunsch, D.: Adaptive scaling of cluster boundaries for large-scale social media data clustering. IEEE Trans. Neural Netw. Learn. 27(12), 2656–2669 (2015)

    Article  Google Scholar 

  46. Mrozek, D.: High-Performance Computational Solutions in Protein Bioinformatics. SpringerBriefs in Computer Science. Springer International Publishing, Cham (2014)

    Book  Google Scholar 

  47. Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016)

    Article  Google Scholar 

  48. Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling Ab Initio predictions of 3D protein structures in Microsoft Azure cloud. J Grid Comput. 13, 561–585 (2015)

    Article  Google Scholar 

  49. Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerating 3D protein structure similarity searching on Microsoft Azure Cloud with local replicas of macromolecular data. In: Wyrzykowski, R. (ed.) Parallel Processing and Applied Mathematics - PPAM 2015. Lecture Notes in Computer Science, vol. 9574, pp. 1–12. Springer, Heidelberg (2016)

    Google Scholar 

  50. Mrozek, D., Małysiak-Mrozek, B., Kłapciński, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014)

    Article  Google Scholar 

  51. Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kozielski, S.: Life sciences data analysis. Inform. Sci. 384, 86–89 (2017)

    Article  MATH  Google Scholar 

  52. Piovesan, D., Tabaro, F., Mičetić, I., Necci, M., Quaglia, F., Oldfield, C.J., Aspromonte, M.C., Davey, N.E., Davidović, R., Dosztányi, Z., Elofsson, A., Gasparini, A., Hatos, A., Kajava, A.V., Kalmar, L., Leonardi, E., Lazar, T., Macedo-Ribeiro, S., Macossay-Castillo, M., Meszaros, A., Minervini, G., Murvai, N., Pujols, J., Roche, D.B., Salladini, E., Schad, E., Schramm, A., Szabo, B., Tantos, A., Tonello, F., Tsirigos, K.D., Veljković, N., Ventura, S., Vranken, W., Warholm, P., Uversky, V.N., Dunker, A.K., Longhi, S., Tompa, P., Tosatto, S.C.: DisProt 7.0: a major update of the database of disordered proteins. Nucleic Acids Res. 45(D1), D219–D227 (2017). https://doi.org/10.1093/nar/gkw1056

    Article  Google Scholar 

  53. Powers, D.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Int. J. Mach. Learn. Technol. 2, 37–63 (2011)

    Article  Google Scholar 

  54. Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., Gannon, D.: Cloud technologies for bioinformatics applications. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pp. 6:1–6:10. MTAGS ’09, ACM, New York, NY, USA (2009). https://doi.org/10.1145/1646468.1646474

  55. Radenski, A., Ehwerhemuepha, L.: Speeding-up codon analysis on the cloud with local MapReduce aggregation. Inf. Sci. 263, 175–185 (2014)

    Article  Google Scholar 

  56. Rose, A.S., Hildebrand, P.W.: NGL viewer: a web application for molecular visualization. Nucleic Acids Res. 43(W1), W576–W579 (2015). https://doi.org/10.1093/nar/gkv402

    Article  Google Scholar 

  57. Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009)

    Article  Google Scholar 

  58. Shimizu, K., Hirose, S., Noguchi, T.: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics 23(17), 2337–2338 (2007). https://doi.org/10.1093/bioinformatics/btm330

    Article  Google Scholar 

  59. Sickmeier, M., Hamilton, J.A., LeGall, T., Vacic, V., Cortese, M.S., Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V.N., Obradovic, Z., Dunker, A.K.: DisProt: the database of disordered proteins. Nucleic Acids Res. 35\((\text{suppl}\_1)\), D786–D793 (2007). https://doi.org/10.1093/nar/gkl893

    Article  Google Scholar 

  60. Su, C.T., Chen, C.Y., Hsu, C.M.: iPDA: integrated protein disorder analyzer. Nucleic Acids Res. 35(suppl\({\_}\)2), W465–W472 (2007). https://doi.org/10.1093/nar/gkm353

    Article  Google Scholar 

  61. Teijeiro, D., Pardo, X.C., Penas, D.R., González, P., Banga, J.R., Doallo, R.: A cloud-based enhanced differential evolution algorithm for parameter estimation problems in computational systems biology. Clust. Comput. 20(3), 1937–1950 (2017). https://doi.org/10.1007/s10586-017-0860-1

    Article  Google Scholar 

  62. The UniProt consortium: Uniprot: the universal protein knowledgebase. Nucleic Acids Res. 45(D1), D158–D169 (2017). https://doi.org/10.1093/nar/gkw1099

  63. Tripathy, B.K., Mittal, D.: Hadoop based uncertain possibilistic kernelized c-means algorithms for image segmentation and a comparative analysis. Appl. Soft Comput. 46, 886–923 (2016)

    Article  Google Scholar 

  64. Vullo, A., Bortolami, O., Pollastri, G., Tosatto, S.C.E.: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 34\((\text{ suppl }\_2)\), W164–W168 (2006). https://doi.org/10.1093/nar/gkl166

    Article  Google Scholar 

  65. Wang, H., Li, J., Hou, Z., Fang, R., Mei, W., Huang, J.: Research on parallelized real-time map matching algorithm for massive GPS data. Clust. Comput. 20(2), 1123–1134 (2017). https://doi.org/10.1007/s10586-017-0869-5

    Article  Google Scholar 

  66. Wang, C., Li, X., Zhou, X., Wang, A., Nedjah, N.: Soft computing in Big Data intelligent transportation systems. Appl. Soft Comput. 38, 1099–1108 (2016)

    Article  Google Scholar 

  67. Wang, Z., Tu, L., Guo, Z., Yang, L.T., Huang, B.: Analysis of user behaviors by mining large network data sets. Future Gener. Comput. Syst. 37, 429–437 (2014)

    Article  Google Scholar 

  68. Ward, J.J., McGuffin, L.J., Bryson, K., Buxton, B.F., Jones, D.T.: The DISOPRED server for the prediction of protein disorder. Bioinformatics 20(13), 2138–2139 (2004). https://doi.org/10.1093/bioinformatics/bth195

    Article  Google Scholar 

  69. Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016). https://doi.org/10.1007/s10586-016-0581-x

    Article  Google Scholar 

  70. Xue, B., Dunbrack, R.L., Williams, R.W., Dunker, A.K., Uversky, V.N.: Pondr-fit: a meta-predictor of intrinsically disordered amino acids. Biochim. Biophys. Acta (BBA) - Proteins Proteomics 1804(4), 996–1010 (2010). http://www.sciencedirect.com/science/article/pii/S1570963910000130

    Article  Google Scholar 

  71. Yang, C.T., Chen, S.T., Yan, Y.Z.: The implementation of a cloud city traffic state assessment system using a novel big data architecture. Clust. Comput. 20(2), 1101–1121 (2017). https://doi.org/10.1007/s10586-017-0846-z

    Article  Google Scholar 

  72. Yang, Z.R., Thomson, R., McNeil, P., Esnouf, R.M.: RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21(16), 3369–3376 (2005). https://doi.org/10.1093/bioinformatics/bti534

    Article  Google Scholar 

  73. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664

    Article  Google Scholar 

  74. Zhang, T., Faraggi, E., Li, Z., Zhou, Y.: Intrinsic disorder and Semi-disorder prediction by SPINE-D, pp. 159–174. Springer, New York (2017). https://doi.org/10.1007/978-1-4939-6406-2_12

    Google Scholar 

  75. Zhong, Y., Zhang, L., Xing, S., Li, F., Wan, B.: The Big Data processing algorithm for water environment monitoring of the three gorges reservoir area. Abstr. Appl. Anal. 2014 (2014)

    Google Scholar 

  76. Zou, Q., Hu, Q., Guo, M., Wang, G.: HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dariusz Mrozek .

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Mrozek, D. (2018). Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud. In: Scalable Big Data Analytics for Protein Bioinformatics. Computational Biology, vol 28. Springer, Cham. https://doi.org/10.1007/978-3-319-98839-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98839-9_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98838-2

  • Online ISBN: 978-3-319-98839-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics