Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Optimized cloud-based scheduling for protein secondary structure analysis

  • 123 Accesses

  • 2 Citations

Abstract

In the domain of proteomics, an in-depth analysis of the 3D structure of a protein is of paramount importance for many biological studies and applications. At the secondary level, protein structure can be described in terms of motifs, recurrent patterns of smaller biological structures called secondary structure elements. In this paper, the focus is on the identification of geometrical motifs in different proteins using the Cross Motif Search (CMS) algorithm. Such task, due to the high computational cost of CMS with respect to traditional alignment algorithms, is very demanding, and thus, parallel processing is mandatory. In previous papers, CMS parallelization has been already studied from the HPC standpoint. Since cloud computing is emerging as an alternative to on-premise HPC systems, it is worthwhile examining the feasibility and possible advantages in terms of both performance and costs, of migrating to a cloud implementation. This paper is an extension of a preliminary work carried out on the cloud parallelization of CMS. The paper has two main contributions. First of all, an analytic model of the communication pattern of CMS is described, in order to get insights on the performance of the application when executed on a cloud infrastructure. Secondly, an optimized “location-aware” scheduling policy to assign workload to the application workers is introduced, in order to minimize internode communication in a cloud setting. Experiments are presented in order to validate the newly introduced scheduling policy and assess the performance of the cloud implementation of CMS. The results presented in this paper are general, in the sense that they can be applied to any other algorithm with a communication pattern similar to the one of the target applications.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Change history

  • 17 June 2019

    Mirto Musci was not listed among the authors. The original article has been corrected.

Notes

  1. 1.

    https://vision.unipv.it/bioinformatics/contents/tools.php.

  2. 2.

    Note that 1k32 and 1bgl share identical chains; using a priori biological information would greatly reduce the computational time. As stated before, however, CMS only focuses on geometrical information.

References

  1. 1.

    Ferretti M, Santangelo L (2018) Protein secondary structure analysis in the cloud. In: Vega-Rodrguez MA, Santander-Jimnez S, Granado-Criado JM, Badia RM (eds) Proceedings of the 6th International Workshop on Parallelism in Bioinformatics (PBio 2018). ACM, New York, pp 63–70

  2. 2.

    Yang H, Tate M (2012) A descriptive literature review and classification of cloud computing research. CAIS 31:2

  3. 3.

    Mell P, Grance T (2011) The NIST definition of cloud computing. Retrieved from http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf

  4. 4.

    Carlyle G, Harrell SL, Smith PM (2010) Cost-effective HPC: the community or the cloud? In: IEEE 2nd International Conference on Cloud Computing Technology and Science, Indianapolis, IN, 2010, pp 169–176

  5. 5.

    Hassani R, Aiatullah Md, Luksch P (2014) Improving HPC application performance in public cloud. In: IERI Procedia 10:169–176, ISSN 2212-6678

  6. 6.

    Mancini M, Aloisio G (2015) How advanced cloud technologies can impact and change HPC environments for simulation. In: International Conference on High Performance Computing & Simulation (HPCS), Amsterdam, 2015, pp 667–668

  7. 7.

    Yang T, Ma X, Mueller F (2005) Predicting parallel applications performance across platforms using partial execution. In: ACM/IEEE Supercomputing Conference

  8. 8.

    Chakthranont N, Khunphet P, Takano R, Ikegami T (2014) Exploring the performance impact of virtualization on an HPC cloud. In: IEEE 6th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, pp 426–432

  9. 9.

    Expsito RR, Taboada GL, Ramos S, Tourino J, Doallo R (2013) Performance analysis of HPC applications in the cloud. Fut Gen Comput Syst 29(1):218–229

  10. 10.

    Ferretti M, Musci M, Santangelo L (2014) A hybrid OpenMP and OpenMPI approach to geometrical motif search in proteins. In: Proceedings of the IEEE International Conference on Cluster Computing (IEEE Cluster 2014), IEEE Computer Society, 2014, pp 298–304

  11. 11.

    Ferretti M, Musci M, Santangelo L (2015) MPI-CMS: a hybrid parallel approach to geometrical motif search in proteins. Concurr Comput Pract Exp 27(18):5500–5516

  12. 12.

    Ferretti M, Santangelo L (2018) Hybrid OpenMP-MPI parallelism: porting experiments from small to large clusters. In: 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP 2018, Cambridge, UK, March 21–23, 2018. IEEE Computer Society 2018, pp 297–301

  13. 13.

    Ferretti M, Musci M (2013) Entire motifs search of secondary structures in proteins: a parallelization study. In: Proceedings of the 20th European MPI Users’ Group Meeting. ACM

  14. 14.

    Drago G, Ferretti M, Musci M (2013) CCMS: A greedy approach to motif extraction. In: International Conference on Image Analysis and Processing. Springer, Berlin

  15. 15.

    Ferretti M, Musci M (2015) Geometrical motifs search in proteins: a parallel approach. Paral Comput 42:60–74

  16. 16.

    Cantoni V et al (2016) Structural motifs identification and retrieval: a geometrical approach. In: Pattern Recognition in Computational Molecular Biology: Techniques and Approaches. Wiley

  17. 17.

    Casavant TL, Kuhl JG (1998) A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Trans Soft Eng 14:141–154

  18. 18.

    Plastino A, Ribeiro CC, Rodriguez NR (2001) Load balancing algorithms for SPMD applications. Retrieved from https://pdfs.semanticscholar.org/f5d0/edd1e1e4268549e1f28f141347482ee56fea.pdf

  19. 19.

    Osman A, Ammar H (2002) Dynamic load balancing strategies for parallel computers. Sci Ann Cuza Univ 11:110–120

  20. 20.

    Amandeep K, Pawan LM (2018) A review on load balancing in cloud environment. Int J Comput Technol 17(1):7120–7125

  21. 21.

    Sarood O, Gupta A, Kal LV (2012) Cloud friendly load balancing for hpc applications: Preliminary work. In: 41st International Conference on Parallel Processing Workshops. IEEE

  22. 22.

    Rathore J, Keswani B, Rathore VS (2019) Analysis of load balancing algorithms using cloud analyst. In: Rathore V, Worring M, Mishra D, Joshi A, Maheshwari S (eds) Emerging Trends in Expert Applications and Security. Advances in Intelligent Systems and Computing, vol 841. Springer, Singapore

  23. 23.

    Hota A, Mohapatra S, Mohanty S (2019) Survey of different load balancing approach-based algorithms in cloud computing: a comprehensive review. In: Behera H, Nayak J, Naik B, Abraham A (eds) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, vol 711. Springer, Singapore

  24. 24.

    Gupta A et al (2013) Improving HPC application performance in cloud through dynamic load balancing. In: 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE

  25. 25.

    Benchara FZ et al (2016) A new efficient distributed computing middleware based on cloud micro-services for HPC. In: 5th International Conference on Multimedia Computing and Systems (ICMCS). IEEE

  26. 26.

    Suh E, Narahari B, Simha R (1998) Dynamic load balancing schemes for computing accessible surface area of Protein molecules. In: Proceedings of the 5th International Conference on High Performance Computing (Cat. No. 98EX238). IEEE

  27. 27.

    Young WS, Brooks III CL (1995) Dynamic load balancing algorithms for replicated data molecular dynamics. J Comput Chem 16(6):715–722

  28. 28.

    Mrozek D, Maysiak-Mrozek B, Kapciski A (2014) Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19):2822–2825

  29. 29.

    Auricchio F et al (2018) Benchmarking a hemodynamics application on Intel based HPC systems. Paral Comput Everywhere 32:57

  30. 30.

    Ferretti M, Santangelo L (2019) Profiling hemodynamic application for parallel computing in the cloud. in: 27th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP2019)

  31. 31.

    Auricchio F et al (2018) Parallelizing a finite element solver in computational hemodynamics: a black box approach. Int J High Perform Comput Appl 32(3):351–362

  32. 32.

    Auricchio F et al (2015) Assessment of a black-box approach for a parallel finite elements solver in computational hemodynamics. In: IEEE Trustcom/BigDataSE/ISPA, vol 3. IEEE

  33. 33.

    Do Chuong B, Katoh K (2009) Protein multiple sequence alignment. In: Functional Proteomics. Humana Press, pp 379–413

  34. 34.

    Holm L, Sander C (1993) Protein structure comparison by alignment of distance matrices. J Mol Biol 233(1):123–138

  35. 35.

    Shi S et al (2007) Searching for three-dimensional secondary structural patterns in proteins with ProSMoS. Bioinformatics 23(11):1331–1338

  36. 36.

    Shi S, Chitturi B, Grishin NV (2009) ProSMoS server: a pattern-based search using interaction matrix representation of protein structures. Nucl Acids Res 37(suppl2):W526–W531

  37. 37.

    Hutchinson EG, Thornton Janet M (1996) PROMOTIF—a program to identify and analyze structural motifs in proteins. Prot Sci 5(2):212–220

  38. 38.

    Dror O et al (2003) MASS: multiple structural alignment by secondary structures. Bioinformatics 19(suppl1):i95–i104

  39. 39.

    Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr Sect D 60(12):2256–2268

  40. 40.

    Aung Z, Li J (2007) Mining super-secondary structure motifs from 3d protein structures: a sequence order independent approach. Genome Inform 19:1526

  41. 41.

    Cantoni V et al (2014) Protein motif retrieval by secondary structure element geometry and biological features saliency. In: 25th International Workshop on Database and Expert Systems Applications. IEEE

  42. 42.

    Argentieri T, Cantoni V, Musci M (2017) Extending cross motif search with heuristic data mining. In: 28th International Workshop on Database and Expert Systems Applications (DEXA). IEEE

  43. 43.

    Musci M, Ferretti M (2018) Mining geometrical motifs co-occurrences in the CMS dataset. In: International Conference on Database and Expert Systems Applications. Springer, Cham

  44. 44.

    Ballard DH (1981) Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognit 13(2):111–122, ISSN 0031-3203,

  45. 45.

    Argentieri T, Cantoni V, Musci M (2016) MotifVisualizer: an interdisciplinary GUI for geometrical motif retrieval in proteins. In: 27th International Workshop on Database and Expert Systems Applications (DEXA). IEEE

  46. 46.

    Protein Data Bank. 2019, March 6. Retrieved from https://www.rcsb.org

  47. 47.

    Wesbrook J, Ito N, Nakamura H, Henrick K, Berman HM (2004) PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7):988–992

  48. 48.

    Tata S, Friedman JS, Swaroop A (2006) Declarative querying for biological sequences. In: 22nd International Conference on Data Engineering (ICDE’06). IEEE

  49. 49.

    Mrozek D et al (2016) An efficient and flexible scanning of databases of protein secondary structures. J Intell Inform Syst 46(1):213–233

  50. 50.

    Hammel L, Patel JM (2002) Searching on the secondary structure of protein sequences. In: VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Morgan Kaufmann

  51. 51.

    Wang Y, Sunderraman Rr, Tian H (2006) A domain specific data management architecture for protein structure data. In: International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE

  52. 52.

    Murzin Alexey G et al (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540

  53. 53.

    Marconi (2017) the new Tier-0 system. 2017, July 21. Retrieved from http://hpc.cineca.it/hardware/marconi

  54. 54.

    Kielmann T, Bal H E, Verstoep K (2000) Fast measurement of LogP parameters for message passing platforms. In: International Parallel and Distributed Processing Symposium. Springer, Berlin

  55. 55.

    Machined types. 2018, May 16. Retrieved from https://cloud.google.com/compute/docs/machine-types

  56. 56.

    Advanced VPC Concept. 2018, December 17. Retrieved from https://cloud.google.com/vpc/docs/advanced-vpc

  57. 57.

    Quota. 2019, March 06. Retrieved from https://cloud.google.com/vpc/docs/quota

  58. 58.

    Nomura A, Matsuba H, Ishikawa Y (2007) Network performance model for TCP/IP based cluster computing. In: IEEE International Conference on Cluster Computing, Austin, TX, 2007, pp 194–203

  59. 59.

    Li L, Zhang X, Feng J, Dong X (2010) mPlogP: a parallel computation model for heterogeneous multi-core computer. In: 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, Melbourne, VIC, 2010, pp 679–684

  60. 60.

    Hoefler T, Mehlan T, Lumsdaine A, Rehm W (2007) Netgauge: a network performance measurement framework. In: Perrott R, Chapman BM, Subhlok J, de Mello RF, Yang LT (eds) High Performance Computing and Communications. HPCC 2007. Lecture Notes in Computer Science, vol 4782. Springer, Berlin

  61. 61.

    Hockney R (1994) The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput 20(3):389–398

  62. 62.

    Alexandrov A, Ionescu MF, Schauser KE, Scheiman C (1995) LogGP: incorporating long messages into the LogP model. In: Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM Press, New York, pp 95–105

  63. 63.

    Culler D, Karp R, Patterson D, Sahay A, Schauser KE, Santos E, Subramonian R, von Eicken T (1993) LogP: towards a realistic model of parallel computation. In: Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, New York, p 112

  64. 64.

    Steffenel LA, Mounie G (2008) A framework for adaptive collective communications for heterogeneous hierarchical computing systems. J Comput Syst Sci 74(6):1082–1093

Download references

Author information

Correspondence to Luigi Santangelo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version of this article was revised: Mirto Musci was not listed among the authors.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ferretti, M., Santangelo, L. & Musci, M. Optimized cloud-based scheduling for protein secondary structure analysis. J Supercomput 75, 3499–3520 (2019). https://doi.org/10.1007/s11227-019-02859-w

Download citation

Keywords

  • Proteomics
  • Cloud computing
  • HPC
  • Cross Motif Search
  • CINECA
  • Google Cloud
  • pLogP