Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud

Mrozek, Dariusz

doi:10.1007/978-3-319-98839-9_9

Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud

Dariusz Mrozek ORCID: orcid.org/0000-0001-6764-6656²⁴

Chapter
First Online: 26 September 2018

826 Accesses
1 Citations

Part of the book series: Computational Biology ((COBO,volume 28))

Abstract

Intrinsically disordered proteins (IDPs) constitute a wide range of molecules that act in cells of living organisms and mediate many protein–protein interactions and many regulatory processes. Computational identification of disordered regions in protein amino acid sequences, thus, became an important branch of 3D protein structure prediction and modeling. In this chapter, we will see the IDP meta-predictor that applies an ensemble of primary predictors in order to increase the quality of IDP prediction. We will also see the highly scalable implementation of the meta-predictor on the Spark cluster (Spark-IDPP) that mitigates the problem of the exponentially growing number of protein amino acid sequences in public repositories. Spark-IDPP responds very well to the current needs of IDP prediction by parallelizing computations on the Spark cluster that can be scaled on demand on the Microsoft Azure cloud according to particular requirements for computing power.

Big data is mostly about taking numbers and using those numbers to make predictions about the future. The bigger the data set you have, the more accurate the predictions about the future will be

Anthony Goldbloom

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997). https://doi.org/10.1093/nar/25.17.3389
Article Google Scholar
Bai, C., Dhavale, D., Sarkis, J.: Complex investment decisions using rough set and fuzzy c-means: an example of investment in green supply chains. Eur. J. Oper. Res. 248(2), 507–521 (2016)
Article MathSciNet MATH Google Scholar
Baron, T.: Prediction of intrinsically disordered proteins in Apache Spark. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2016)
Google Scholar
Bayer, P., Arndt, A., Metzger, S., Mahajan, R., Melchior, F., Jaenicke, R., Becker, J.: Structure determination of the small ubiquitin-related modifier SUMO-1. J. Mol. Biol. 280(2), 275–286 (1998). http://www.sciencedirect.com/science/article/pii/S0022283698918393
Article Google Scholar
Benson, D.A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: GenBank. Nucleic Acids Res. 45(D1), D37–D42 (2017). https://doi.org/10.1093/nar/gkw1070
Article Google Scholar
Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)
Article Google Scholar
Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bansal, P., Bridge, A.J., Poux, S., Bougueleret, L., Xenarios, I.: UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View, pp. 23–54. Springer, New York (2016)
Chapter Google Scholar
Ceri, S., Kaitoua, A., Masseroli, M., Pinoli, P., Venco, F.: Data management for heterogeneous genomic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform. 99, 1–1 (2016)
Google Scholar
Chang, H., Mishra, N., Lin, C.: IoT Big-Data centred knowledge granule analytic and cluster framework for BI applications: a case base analysis. Plos One 10, 1–23 (2015)
Google Scholar
Cheng, J., Sweredoski, M.J., Baldi, P.: Accurate prediction of protein disordered regions by mining protein structure data. Data Min. Knowl. Discov. 11(3), 213–222 (2005), https://doi.org/10.1007/s10618-005-0001-y
Article MathSciNet Google Scholar
Cupek, R., Ziebinski, A., Huczala, L., Erdogan, H.: Agent-based manufacturing execution systems for short-series production scheduling. Comput. Ind. 82, 245–258 (2016)
Article Google Scholar
Czerniak, J.M., Dobrosielski, W.T., Apiecionek, Ł., Ewald, D.: Representation of a trend in OFN during fuzzy observance of the water level from the Crisis control center. In: Proceedings of the 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 443–447 (2015)
Google Scholar
Davis, G.B., Carley, K.M.: Clearing the fog: fuzzy, overlapping groups for social networks. Soc. Netw. 30(3), 201–212 (2008)
Article Google Scholar
De Maio, C., Fenza, G., Loia, V., Parente, M.: Time aware knowledge extraction for microblog summarization on Twitter. Inf. Fus. 28, 60–74 (2016)
Article Google Scholar
Dosztányi, Z., Csizmok, V., Tompa, P., Simon, I.: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21(16), 3433–3434 (2005). https://doi.org/10.1093/bioinformatics/bti541
Article Google Scholar
Dunker, A.K., Silman, I., Uversky, V.N., Sussman, J.L.: Function and structure of inherently disordered proteins. Curr. Opin. Struct. Biol. 18(6), 756–764 (2008)
Article Google Scholar
Feng, X., Grossman, R., Stein, L.: PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinform.12(1), 1–11 (2011), https://doi.org/10.1186/1471-2105-12-139
Article Google Scholar
Guo, K., Zhang, R., Kuang, L.: TMR: towards an efficient semantic-based heterogeneous transportation media Big Data retrieval. Neurocomputing 181, 122–131 (2016)
Article Google Scholar
Hazelhurst, S.: PH2: an Hadoop-based framework for mining structural properties from the PDB database. In: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 104–112 (2010)
Google Scholar
Hirose, S., Shimizu, K., Kanai, S., Kuroda, Y., Noguchi, T.: POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics 23(16), 2046–2053 (2007). https://doi.org/10.1093/bioinformatics/btm302
Article Google Scholar
Hu, C., Ren, G., Liu, C., Li, M., Jie, W.: A Spark-based genetic algorithm for sensor placement in large scale drinking water distribution systems. Clust. Comput. 20(2), 1089–1099 (2017). https://doi.org/10.1007/s10586-017-0838-z
Article Google Scholar
Hung, C.L., Hua, G.J.: Cloud Computing for protein-ligand binding site comparison. Biomed Res. Int. 170356 (2013)
Google Scholar
Hung, C.L., Lin, C.Y.: Open reading frame phylogenetic analysis on the cloud. Int. J. Genomics 2013(614923), 1–9 (2013)
Google Scholar
Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on cloud. Int. J. Genomics 439681, 1–8 (2013)
Google Scholar
Ishida, T., Kinoshita, K.: PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res. 35(suppl\(\_\)2), W460–W464 (2007). https://doi.org/10.1093/nar/gkm363
Article Google Scholar
Jensen, K., Nguyen, H.T., Do, T.V., Årnes, A.: a big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017). https://doi.org/10.1007/s10586-017-0811-x
Article Google Scholar
Jin, Y., Dunbrack, R.: Assessment of disorder predictions in CASP6. Proteins 61, 167–175 (2005)
Article Google Scholar
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1987)
Article Google Scholar
Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), 1–13 (2010). https://doi.org/10.1186/gb-2010-11-11-r116
Article Google Scholar
Kozlowski, L.P., Bujnicki, J.M.: MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinform. 13(1), 111 (2012). https://doi.org/10.1186/1471-2105-13-111
Article Google Scholar
Langmead, B., Hansen, K.D., Leek, J.T.: Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11(8), 1–11 (2010). https://doi.org/10.1186/gb-2010-11-8-r83
Article Google Scholar
Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with Cloud computing. Genome Biol. 10(11), 1–10 (2009). https://doi.org/10.1186/gb-2009-10-11-r134
Article Google Scholar
Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., et al.: Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinform. 13, 324 (2012)
Article Google Scholar
Linding, R., Jensen, L.J., Diella, F., Bork, P., Gibson, T.J., Russell, R.B.: Protein disorder prediction: implications for structural proteomics. Structure 11(11), 1453–1459 (2003). http://www.sciencedirect.com/science/article/pii/S0969212603002351
Article Google Scholar
Linding, R., Russell, R.B., Neduva, V., Gibson, T.J.: GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res. 31(13), 3701–3708 (2003). https://doi.org/10.1093/nar/gkg519
Article Google Scholar
Lipman, D., Pearson, W.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985)
Article Google Scholar
Lu, H., Sun, Z., Qu, W.: Big Data-driven based real-time traffic flow state identification and prediction. Discret. Dyn. Nat. Soc. 2015, 1–11 (2015)
MathSciNet Google Scholar
Lu, H., Sun, Z., Qu, W., Wang, L.: Real-time corrected traffic correlation model for traffic flow forecasting. Math. Probl. Eng. 2015, 1–7 (2015)
Google Scholar
Mahmud, S., Iqbal, R., Doctor, F.: Cloud enabled data analytics and visualization framework for health-shocks prediction. Future Gener. Comput. Syst. 65, 169–181 (2016). http://www.sciencedirect.com/science/article/pii/S0167739X15003271. (special Issue on Big Data in the Cloud)
Article Google Scholar
Małysiak-Mrozek, B., Baron, T., Mrozek, D.: Spark-IDPP: High throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud, J. Clus. Comp, 1–35 (in review)
Google Scholar
Małysiak-Mrozek, B., Stabla, M., Mrozek, D.: Soft and declarative fishing of information in Big Data lake. IEEE Trans. Fuzzy Syst. 99, 1–1 (2018)
Google Scholar
Małysiak-Mrozek, B., Zur, K., Mrozek, D.: In-memory management system for 3D protein macromolecular structures. Curr. Proteomics 15 (2018). https://doi.org/10.2174/1570164615666180320151452
Article Google Scholar
Matsunaga, A., Tsugawa, M., Fortes, J.: Cloudblast: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: Proceedings of the IEEE Fourth International Conference on eScience (ESCIENCE ’08), pp. 222–229 (2008)
Google Scholar
Matthews, S.J., Williams, T.L.: MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinform. 11(1), 1–9 (2010). https://doi.org/10.1186/1471-2105-11-S1-S15
Article Google Scholar
Meng, L., Tan, A., Wunsch, D.: Adaptive scaling of cluster boundaries for large-scale social media data clustering. IEEE Trans. Neural Netw. Learn. 27(12), 2656–2669 (2015)
Article Google Scholar
Mrozek, D.: High-Performance Computational Solutions in Protein Bioinformatics. SpringerBriefs in Computer Science. Springer International Publishing, Cham (2014)
Book Google Scholar
Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016)
Article Google Scholar
Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling Ab Initio predictions of 3D protein structures in Microsoft Azure cloud. J Grid Comput. 13, 561–585 (2015)
Article Google Scholar
Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerating 3D protein structure similarity searching on Microsoft Azure Cloud with local replicas of macromolecular data. In: Wyrzykowski, R. (ed.) Parallel Processing and Applied Mathematics - PPAM 2015. Lecture Notes in Computer Science, vol. 9574, pp. 1–12. Springer, Heidelberg (2016)
Google Scholar
Mrozek, D., Małysiak-Mrozek, B., Kłapciński, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014)
Article Google Scholar
Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kozielski, S.: Life sciences data analysis. Inform. Sci. 384, 86–89 (2017)
Article MATH Google Scholar
Piovesan, D., Tabaro, F., Mičetić, I., Necci, M., Quaglia, F., Oldfield, C.J., Aspromonte, M.C., Davey, N.E., Davidović, R., Dosztányi, Z., Elofsson, A., Gasparini, A., Hatos, A., Kajava, A.V., Kalmar, L., Leonardi, E., Lazar, T., Macedo-Ribeiro, S., Macossay-Castillo, M., Meszaros, A., Minervini, G., Murvai, N., Pujols, J., Roche, D.B., Salladini, E., Schad, E., Schramm, A., Szabo, B., Tantos, A., Tonello, F., Tsirigos, K.D., Veljković, N., Ventura, S., Vranken, W., Warholm, P., Uversky, V.N., Dunker, A.K., Longhi, S., Tompa, P., Tosatto, S.C.: DisProt 7.0: a major update of the database of disordered proteins. Nucleic Acids Res. 45(D1), D219–D227 (2017). https://doi.org/10.1093/nar/gkw1056
Article Google Scholar
Powers, D.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Int. J. Mach. Learn. Technol. 2, 37–63 (2011)
Article Google Scholar
Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., Gannon, D.: Cloud technologies for bioinformatics applications. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pp. 6:1–6:10. MTAGS ’09, ACM, New York, NY, USA (2009). https://doi.org/10.1145/1646468.1646474
Radenski, A., Ehwerhemuepha, L.: Speeding-up codon analysis on the cloud with local MapReduce aggregation. Inf. Sci. 263, 175–185 (2014)
Article Google Scholar
Rose, A.S., Hildebrand, P.W.: NGL viewer: a web application for molecular visualization. Nucleic Acids Res. 43(W1), W576–W579 (2015). https://doi.org/10.1093/nar/gkv402
Article Google Scholar
Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009)
Article Google Scholar
Shimizu, K., Hirose, S., Noguchi, T.: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics 23(17), 2337–2338 (2007). https://doi.org/10.1093/bioinformatics/btm330
Article Google Scholar
Sickmeier, M., Hamilton, J.A., LeGall, T., Vacic, V., Cortese, M.S., Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V.N., Obradovic, Z., Dunker, A.K.: DisProt: the database of disordered proteins. Nucleic Acids Res. 35\((\text{suppl}\_1)\), D786–D793 (2007). https://doi.org/10.1093/nar/gkl893
Article Google Scholar
Su, C.T., Chen, C.Y., Hsu, C.M.: iPDA: integrated protein disorder analyzer. Nucleic Acids Res. 35(suppl\({\_}\)2), W465–W472 (2007). https://doi.org/10.1093/nar/gkm353
Article Google Scholar
Teijeiro, D., Pardo, X.C., Penas, D.R., González, P., Banga, J.R., Doallo, R.: A cloud-based enhanced differential evolution algorithm for parameter estimation problems in computational systems biology. Clust. Comput. 20(3), 1937–1950 (2017). https://doi.org/10.1007/s10586-017-0860-1
Article Google Scholar
The UniProt consortium: Uniprot: the universal protein knowledgebase. Nucleic Acids Res. 45(D1), D158–D169 (2017). https://doi.org/10.1093/nar/gkw1099
Tripathy, B.K., Mittal, D.: Hadoop based uncertain possibilistic kernelized c-means algorithms for image segmentation and a comparative analysis. Appl. Soft Comput. 46, 886–923 (2016)
Article Google Scholar
Vullo, A., Bortolami, O., Pollastri, G., Tosatto, S.C.E.: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 34\((\text{ suppl }\_2)\), W164–W168 (2006). https://doi.org/10.1093/nar/gkl166
Article Google Scholar
Wang, H., Li, J., Hou, Z., Fang, R., Mei, W., Huang, J.: Research on parallelized real-time map matching algorithm for massive GPS data. Clust. Comput. 20(2), 1123–1134 (2017). https://doi.org/10.1007/s10586-017-0869-5
Article Google Scholar
Wang, C., Li, X., Zhou, X., Wang, A., Nedjah, N.: Soft computing in Big Data intelligent transportation systems. Appl. Soft Comput. 38, 1099–1108 (2016)
Article Google Scholar
Wang, Z., Tu, L., Guo, Z., Yang, L.T., Huang, B.: Analysis of user behaviors by mining large network data sets. Future Gener. Comput. Syst. 37, 429–437 (2014)
Article Google Scholar
Ward, J.J., McGuffin, L.J., Bryson, K., Buxton, B.F., Jones, D.T.: The DISOPRED server for the prediction of protein disorder. Bioinformatics 20(13), 2138–2139 (2004). https://doi.org/10.1093/bioinformatics/bth195
Article Google Scholar
Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016). https://doi.org/10.1007/s10586-016-0581-x
Article Google Scholar
Xue, B., Dunbrack, R.L., Williams, R.W., Dunker, A.K., Uversky, V.N.: Pondr-fit: a meta-predictor of intrinsically disordered amino acids. Biochim. Biophys. Acta (BBA) - Proteins Proteomics 1804(4), 996–1010 (2010). http://www.sciencedirect.com/science/article/pii/S1570963910000130
Article Google Scholar
Yang, C.T., Chen, S.T., Yan, Y.Z.: The implementation of a cloud city traffic state assessment system using a novel big data architecture. Clust. Comput. 20(2), 1101–1121 (2017). https://doi.org/10.1007/s10586-017-0846-z
Article Google Scholar
Yang, Z.R., Thomson, R., McNeil, P., Esnouf, R.M.: RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21(16), 3369–3376 (2005). https://doi.org/10.1093/bioinformatics/bti534
Article Google Scholar
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664
Article Google Scholar
Zhang, T., Faraggi, E., Li, Z., Zhou, Y.: Intrinsic disorder and Semi-disorder prediction by SPINE-D, pp. 159–174. Springer, New York (2017). https://doi.org/10.1007/978-1-4939-6406-2_12
Google Scholar
Zhong, Y., Zhang, L., Xing, S., Li, F., Wan, B.: The Big Data processing algorithm for water environment monitoring of the three gorges reservoir area. Abstr. Appl. Anal. 2014 (2014)
Google Scholar
Zou, Q., Hu, Q., Guo, M., Wang, G.: HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek

Authors

Dariusz Mrozek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dariusz Mrozek .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mrozek, D. (2018). Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud. In: Scalable Big Data Analytics for Protein Bioinformatics. Computational Biology, vol 28. Springer, Cham. https://doi.org/10.1007/978-3-319-98839-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-98839-9_9
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98838-2
Online ISBN: 978-3-319-98839-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics