Case Study of Scientific Data Processing on a Cloud Using Hadoop

  • Chen Zhang
  • Hans De Sterck
  • Ashraf Aboulnaga
  • Haig Djambazian
  • Rob Sladek
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5976)


With the increasing popularity of cloud computing, Hadoop has become a widely used open source cloud computing framework for large scale data processing. However, few efforts have been made to demonstrate the applicability of Hadoop to various real-world application scenarios in fields other than server side computations such as web indexing, etc. In this paper, we use the Hadoop cloud computing framework to develop a user application that allows processing of scientific data on clouds. A simple extension to Hadoop’s MapReduce is described which allows it to handle scientific data processing problems with arbitrary input formats and explicit control over how the input is split. This approach is used to develop a Hadoop-based cloud computing application that processes sequences of microscope images of live cells, and we test its performance. It is discussed how the approach can be generalized to more complicated scientific data processing problems.


Cloud Computing Master Node Slave Node Cloud Application Hadoop Distribute File System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aguilera, M.K., Merchant, A., Shah, M.A., Veitch, A.C., Karamanolis, C.T.: Sinfonia: A New Paradigm for Building Scalable Distributed Systems. In: SOSP 2007 (2007)Google Scholar
  2. 2.
    Aguilera, M., Golab, W., Shah, M.: A Practical Scalable Distributed B-Tree. In: VLDB 2008 (2008)Google Scholar
  3. 3.
    Amazon Elastic Compute Cloud, (retrieved date: September 27, 2009)
  4. 4.
    Apache Hadoop, (retrieved date: September 27, 2009)
  5. 5.
    Apache HBase, (retrieved date: September 27, 2009)
  6. 6.
    Apache Hama, (retrieved date: September 27, 2000)
  7. 7.
    Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauery, R., Pratt, I., Warfield, A.: Xen and the Art of Virtualization. In: SOSP 2003 (2003)Google Scholar
  8. 8.
    Brantner, M., Florescu, D., Graf, D.A., Kossmann, D., Kraska, T.: Building a Database on S3. In: SIGMOD 2008 (2008)Google Scholar
  9. 9.
    Catanzaro, B., Sundaram, N., Keutzer, K.: A MapReduce framework for programming graphics processors. In: Workshop on Software Tools for MultiCore Systems (2008)Google Scholar
  10. 10.
    Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In: VLDB 2008 (2008)Google Scholar
  11. 11.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: BigTable: A Distributed Storage System for Structured Data. In: OSDI 2006 (2006)Google Scholar
  12. 12.
    Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s Hosted Data Serving Platform. In: VLDB 2008 (2008)Google Scholar
  13. 13.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI 2004 (2004)Google Scholar
  14. 14.
    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available Key-Value Store. In: SOSP 2007 (2007)Google Scholar
  15. 15.
    DeWitt, D.J., Robinson, E., Shankar, S., Paulson, E., Naughton, J., Krioukov, A., Royalty, J.: Clustera: An Integrated Computation and Data Management System. In: VLDB 2008 (2008)Google Scholar
  16. 16.
    ELASTRA, (retrieved date: Sepember 27, 2009)
  17. 17.
    Elsayed, T., Lin, J., Oard, D.: Pairwise Document Similarity in Large Collections with MapReduce. In: Proc. Annual Meeting of the Association for Computational Linguistics (2008)Google Scholar
  18. 18.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google File System. In: SOSP 2003 (2003)Google Scholar
  19. 19.
    GigaSpaces, (retrieved date: September 27, 2009)
  20. 20.
    Google and IBM Announce University Initiative, (retrieved date: September 27, 2009)
  21. 21.
    Irwin, D.E., Chase, J.S., Grit, L.E., Yumerefendi, A.R., Becker, D., Yocum, K.: Sharing Networked Resources with Brokered Leases. In: USENIX Annual Conference 2006 (2006)Google Scholar
  22. 22.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In: EuroSys 2007 (2007)Google Scholar
  23. 23.
    McNabb, A.W., Monson, C.K., Seppi, K.D.: MRPSO: MapReduce Particle Swarm Optimization. In: Genetic and Evolutionary Computation Conference (2007)Google Scholar
  24. 24.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A Not-So-Foreign Language for Data Processing. In: SIGMOD 2008 (2008)Google Scholar
  25. 25.
    Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming 13(4) (2005)Google Scholar
  26. 26.
    Ramakrishnan, L., Irwin, D.E., Grit, L.E., Yumerefendi, A.R., Iamnitchi, A., Chase, J.S.: Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control. In: SC 2006 (2006)Google Scholar
  27. 27.
    Scalable Scientific Computing Group, University of Waterloo, (retrieved date: September 27, 2009)
  28. 28.
    Soror, A., Minhas, U.F., Aboulnaga, A., Salem, K., Kokosielis, P., Kamath, S.: Automatic Virtual Machine Configuration for Database Workloads. In: SIGMOD 2008 (2008)Google Scholar
  29. 29.
    Yang, H.C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD 2007 (2007)Google Scholar
  30. 30.
    Zhang, C., De Sterck, H.: CloudWF: A Computational Work ow System for Clouds Based on Hadoop. In: The First International Conference on Cloud Computing, Beijing, China (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Chen Zhang
    • 1
  • Hans De Sterck
    • 2
  • Ashraf Aboulnaga
    • 1
  • Haig Djambazian
    • 3
  • Rob Sladek
    • 3
  1. 1.David R. Cheriton School of Computer ScienceUniversity of WaterlooCanada
  2. 2.Department of Applied MathematicsUniversity of WaterlooCanada
  3. 3.McGill University and Genome Quebec Innovation CentreMontrealCanada

Personalised recommendations