Skip to main content

A Dockerized String Analysis Workflow for Big Data

  • 868 Accesses

Part of the Communications in Computer and Information Science book series (CCIS,volume 1064)

Abstract

Nowadays, a wide range of sciences are moving towards the Big Data era, producing large volumes of data that require processing for new knowledge extraction. Scientific workflows are often the key tools for solving problems characterized by computational complexity and data diversity, whereas cloud computing can effectively facilitate their efficient execution. In this paper, we present a generative big data analysis workflow that can provide analytics, clustering, prediction and visualization services to datasets coming from various scientific fields, by transforming input data into strings. The workflow consists of novel algorithms for data processing and relationship discovery, that are scalable and suitable for cloud infrastructures. Domain experts can interact with the workflow components, set their parameters, run personalized pipelines and have support for decision-making processes. As case studies in this paper, two datasets consisting of (i) Documents and (ii) Gene sequence data are used, showing promising results in terms of efficiency and performance.

Keywords

  • Workflow
  • Docker
  • Big data analytics
  • String analysis

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-30278-8_55
  • Chapter length: 6 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-30278-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.

Notes

  1. 1.

    https://github.com/mariakotouza/ARGP-Tool/wiki/Antigen-receptor-gene-profiler-(ARGP).

  2. 2.

    http://www.imgt.org

  3. 3.

    https://www.rdocumentation.org/packages/cluster/versions/2.0.7-1/topics/diana.

References

  1. Lu, S., et al.: A framework for cloud-based large-scale data analytics and visualization: case study on multiscale climate data. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science, pp. 618–622. IEEE, November 2011

    Google Scholar 

  2. Caíno-Lores, S., Lapin, A., Carretero, J., Kropf, P.: Applying big data paradigms to a large scale scientific workflow: lessons learned and future directions. Future Gen. Comput. Syst. (2018)

    Google Scholar 

  3. Zhao, Y., Fei, X., Raicu, I., Lu, S.: Opportunities and challenges in running scientific workflows on the cloud. In: 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, pp. 455–462. IEEE, October 2011

    Google Scholar 

  4. Berriman, G.B., Deelman, E., Juve, G., Rynge, M., Vöckler, J.S.: The application of cloud computing to scientific workflows: a study of cost and performance. Philos. Trans. Roy. Soc. A: Math. Phys. Eng. Sci. 371(1983), 20120066 (2013)

    CrossRef  Google Scholar 

  5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Kotouza, M., Vavliakis, K., Psomopoulos, F., Mitkas, P.: A hierarchical multi-metric framework for item clustering. In: 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT), pp. 191–197. IEEE, December 2018

    Google Scholar 

  7. Getoor, L., Diehl, C.P.: Link mining: a survey. ACM Sigkdd Explor. Newslett. 7(2), 3–12 (2005)

    CrossRef  Google Scholar 

  8. Cui, P., Wang, X., Pei, J., Zhu, W.: A survey on network embedding. IEEE Trans. Knowl. Data Eng. 31, 833–852 (2018)

    CrossRef  Google Scholar 

  9. Merkel, D.: Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014(239), 2 (2014)

    Google Scholar 

  10. Tsarouchis, S.F., Kotouza, M.T., Psomopoulos, F.E., Mitkas, P.A.: A multi-metric algorithm for hierarchical clustering of same-length protein sequences. In: Iliadis, L., Maglogiannis, I., Plagianakos, V. (eds.) AIAI 2018. IFIPAICT, vol. 520, pp. 189–199. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92016-0_18

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria Th. Kotouza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Kotouza, M.T., Psomopoulos, F.E., Mitkas, P.A. (2019). A Dockerized String Analysis Workflow for Big Data. In: , et al. New Trends in Databases and Information Systems. ADBIS 2019. Communications in Computer and Information Science, vol 1064. Springer, Cham. https://doi.org/10.1007/978-3-030-30278-8_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30278-8_55

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30277-1

  • Online ISBN: 978-3-030-30278-8

  • eBook Packages: Computer ScienceComputer Science (R0)