Advertisement

Efficient 3D Protein Structure Alignment on Large Hadoop Clusters in Microsoft Azure Cloud

  • Bożena Małysiak-Mrozek
  • Paweł Daniłowicz
  • Dariusz MrozekEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 928)

Abstract

Exploration of 3D protein structures provides a broad potential for possible applications of its results in medical diagnostics, drug design, and treatment of patients. 3D protein structure similarity searching is one of the important exploration processes performed in structural bioinformatics. However, the process is time-consuming and requires increased computational resources when performed against large repositories. In this paper, we show that 3D protein structure similarity searching can be significantly accelerated by using modern processing techniques and computer architectures. Results of our experiments prove that by distributing computations on large Hadoop/HBase (HDInsight) clusters and scaling them out and up in the Microsoft Azure public cloud we can reduce the execution times of similarity search processes from hundred of hours to minutes. We will also show that the utilization of public clouds to perform scientific computations is very beneficial and can be successfully applied when scaling time-consuming computations over a mass of biological data.

Keywords

Proteins 3D protein structure Structural bioinformatics Structural alignment Similarity searching Hadoop MapReduce Cloud computing Microsoft Azure 

Notes

Acknowledgments

This work was supported by Microsoft Research within Microsoft Azure for Research Award grant, and Statutory Research funds of Institute of Informatics, Silesian University of Technology, Gliwice, Poland (grant No. BK/213/RAU2/2018).

References

  1. 1.
    Berman, H.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRefGoogle Scholar
  2. 2.
    BioSQL Homepage. http://biosql.org/. Accessed 20 Jan 2018
  3. 3.
    Bourne, P., Berman, H., Watenpaugh, K.: The macromolecular crystallographic information file (mmCIF). Methods Enzymol. 277, 571–590 (1997)CrossRefGoogle Scholar
  4. 4.
    George, L.: HBase: The Definitive Guide, 1st edn. O’Reilly Media, Sebastopol (2011)Google Scholar
  5. 5.
    Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6(3), 377–385 (1996)CrossRefGoogle Scholar
  6. 6.
    Hazelhurst, S.: PH2: an Hadoop-based framework for mining structural properties from the PDB database. In: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 104–112 (2010)Google Scholar
  7. 7.
    Holm, L., Kaariainen, S., Rosenstrom, P., Schenkel, A.: Searching protein structure databases with DaliLite v.3. Bioinformatics 24, 2780–2781 (2008)CrossRefGoogle Scholar
  8. 8.
    Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on cloud. Int. J. Genom. Article ID 439681, pp. 1–8 (2008)Google Scholar
  9. 9.
    Leinweber, M., et al.: GPU-based cloud computing for comparing the structure of protein binding sites. In: 2012 6th IEEE International Conference on Digital Ecosystems and Technologies, DEST, pp. 1–6 (2012)Google Scholar
  10. 10.
    Leinweber, M., Fober, T., Freisleben, B.: GPU-based point cloud superpositioning for structural comparisons of protein binding sites. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1–14 (2018)Google Scholar
  11. 11.
    Leinweber, M., et al.: CavSimBase: a database for large scale comparison of protein binding sites. IEEE Trans. Knowl. Data Eng. 28(6), 1423–1434 (2016)CrossRefGoogle Scholar
  12. 12.
    Mell, P., Grance, T.: The NIST definition of Cloud Computing. Special Publication 800–145 (2011). http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf. Accessed 7 May 2018
  13. 13.
    Microsoft Azure Cloud Services Specification: Sizes for Cloud Services. https://azure.microsoft.com/pl-pl/documentation/articles/cloud-services-sizes-specs/. Accessed 7 May 2018
  14. 14.
    Mrozek, D., Brozek, M., Małysiak-Mrozek, B.: Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA. J. Mol. Model. 20, 2067 (2014)CrossRefGoogle Scholar
  15. 15.
    Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016)CrossRefGoogle Scholar
  16. 16.
    Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerating 3D protein structure similarity searching on Microsoft Azure Cloud with local replicas of macromolecular data. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 254–265. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-32152-3_24CrossRefGoogle Scholar
  17. 17.
    Mrozek, D., Małysiak-Mrozek, B.: CASSERT: a two-phase alignment algorithm for matching 3D structures of proteins. In: Kwiecień, A., Gaj, P., Stera, P. (eds.) CN 2013. CCIS, vol. 370, pp. 334–343. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-38865-1_34CrossRefGoogle Scholar
  18. 18.
    Mrozek, D., Małysiak-Mrozek, B., Kłapciński, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014)CrossRefGoogle Scholar
  19. 19.
    Mrozek, D., Suwała, M., Małysiak-Mrozek, B.: High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model. Knowl. Inf. Syst. (in Press).  https://doi.org/10.1007/s10115-018-1245-3
  20. 20.
    Mrozek, D., Wieczorek, D., Malysiak-Mrozek, B., Kozielski, S.: PSS-SQL: protein secondary structure - structured query language. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 1073–1076 (2010)Google Scholar
  21. 21.
    Mrozek, D., Małysiak-Mrozek, B., Adamek, R.: P3D-SQL: extending Oracle PL/SQL capabilities towards 3D protein structure similarity searching. In: Ortuño, F., Rojas, I. (eds.) IWBBIO 2015. LNCS, vol. 9043, pp. 548–556. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-16483-0_53CrossRefGoogle Scholar
  22. 22.
    Mrozek, D., Socha, B., Kozielski, S., Małysiak-Mrozek, B.: An efficient and flexible scanning of databases of protein secondary structures. J. Intell. Inf. Syst. 46(1), 213–233 (2016).  https://doi.org/10.1007/s10844-014-0353-0CrossRefGoogle Scholar
  23. 23.
    Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247(4), 536–540 (1995). http://www.sciencedirect.com/science/article/pii/S0022283605801342Google Scholar
  24. 24.
    National Research Council: Frontiers in Massive Data Analysis. National Academy Press, Washington, D.C. (2013)Google Scholar
  25. 25.
    Pang, B., Zhao, N., Becchi, M., Korkin, D., Shyu, C.R.: Accelerating large-scale protein structure alignments with graphics processing units. BMC Res. Notes 5(1), 116 (2012).  https://doi.org/10.1186/1756-0500-5-116CrossRefGoogle Scholar
  26. 26.
    Prlić, A., et al.: Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics 26, 2983–2985 (2010)CrossRefGoogle Scholar
  27. 27.
    Prlić, A., Yates, A., Bliven, S.: BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics 28, 2693–2695 (2012)CrossRefGoogle Scholar
  28. 28.
    Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11(9), 739–747 (1998)CrossRefGoogle Scholar
  29. 29.
    Sosinsky, B.: Cloud Computing Bible, 1st edn. Wiley, New York (2011)Google Scholar
  30. 30.
    Stivala, A.D., Stuckey, P.J., Wirth, A.I.: Fast and accurate protein substructure searching with simulated annealing and GPUs. BMC Bioinform. 11(1), 446 (2010).  https://doi.org/10.1186/1471-2105-11-446CrossRefGoogle Scholar
  31. 31.
    Wesbrook, J., Ito, N., Nakamura, H., Henrick, K., Berman, H.: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7), 988–992 (2005)CrossRefGoogle Scholar
  32. 32.
    Westbrook, J., Fitzgerald, P.: The PDB format, mmCIF, and other data formats. Methods Biochem. Anal. 44, 161–79 (2003)Google Scholar
  33. 33.
    Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(2), 246–255 (2003)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Bożena Małysiak-Mrozek
    • 1
  • Paweł Daniłowicz
    • 1
  • Dariusz Mrozek
    • 1
    Email author
  1. 1.Institute of InformaticsSilesian University of TechnologyGliwicePoland

Personalised recommendations