Big Data Operations: Basis for Benchmarking a Data Grid

  • Arcot Rajasekar
  • Reagan Moore
  • Shu Huang
  • Yufeng Xin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8585)


Data Operations over the wide area network are very complex. The end-to-end implementations vary significantly in their efficiency, failure recovery and transactional management. Benchmarking for these operations is vital as we go forward given the exponential growth in data size. The critical evaluation of the types of data operations performed within large-scale data management systems and the comparison of the efficiency of the operations across implementations is an appropriate topic for benchmarking in a big data framework. In this paper, we identify the various operations that are important in large-scale data management and discuss a few of these in terms of data grid benchmarking. These operations form a set of core abstractions that can define interactions with big data systems by domain-centric scientific or business workflow applications. We chose these operational abstractions from our experience in dealing with large-scale distributed systems and with data-intensive computation.


Benchmarking Data grid iRODS Data operations Optimization 



We acknowledge the funding by NSF grant #1247652 “BIGDATA: Mid-Scale: ESCE: DCM: Collaborative Research: DataBridge - A Sociometric System for Long tail Science Data Collections”, by NSF grant #0940841 “DataNet Federation Consortium” and by NSF grant #1032732 “SDCI Data Improvement: Improvement and Sustainability of iRODS Data Grid Software for Multi-Disciplinary Community Driven Application”.


  1. 1.
  2. 2.
  3. 3.
    NSF: Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21).
  4. 4.
    CUAHSI: Consortium of Universities for the Advancement of Hydrologic Science, Inc.
  5. 5.
    DataONE: Data Observation Network for Earth.
  6. 6.
    DFC: The Datanet Federation Consortium.
  7. 7.
  8. 8.
    EarthScope: Exploring the Structure and Evolution of the North American Continent.
  9. 9.
    Amazon Elastic Compute Cloud.
  10. 10.
  11. 11.
    The Gfarm File System.
  12. 12.
    The iPlant Collaborative.
  13. 13.
    iRODS: Data Grids, Digital Libraries, Persistent Archives, and Real-time Data Systems.
  14. 14.
    Moore, R., Rajasekar, A.: Rule-based distributed data management grid. In: 2007 IEEE/ACM International Conference on Grid Computing (2007)Google Scholar
  15. 15.
    Moore, R., Rajasekar, A., de Torcy, A.: Policy-based digital library management. In: International Conference on Digital Libraries, Delhi, India, 24–26 February 2009Google Scholar
  16. 16.
    Rajasekar, A., Wan, M., Moore, M., Schroeder, W.: A prototype rule-based distributed data management system. In: HPDC Workshop on Next Generation Distributed Data Management, Paris, France (2006)Google Scholar
  17. 17.
    Rajasekar, A., Moore, R., Wan, M., Schroeder, W., Hasan, A.: Applying rules as policies for large-scale data sharing. In: 1st International Conference on Intelligent Systems, Modelling and Simulation, Liverpool, UK, 27–29 January 2010Google Scholar
  18. 18.
    Wan, M., Moore, R., Rajasekar, A.: Integration of cloud storage with data grids. In: The Third International Conference on the Virtual Computing Initiative, Research Triangle Park, NC, 22–23 October 2009Google Scholar
  19. 19.
    LSST: The Large Synoptic Survey Telescope.
  20. 20.
    Brown, G.E., Jr.: NEES: Network for Earthquake Engineering Simulation (NEES).
  21. 21.
    NEON: The National Ecological Observatory Network.
  22. 22.
    OOI: The Ocean Observatory Initiative.
  23. 23.
    RDA: The Research Data Alliance.
  24. 24.
    SEAD: Sustainable Environment - Actionable Data.
  25. 25.
    Microsoft SkyDrive.
  26. 26.
    TerraPopulus: Integrated Data on Population and Environment.
  27. 27.
    Baru, C., Moore, R., Rajasekar, A., Wan, M.: The SDSC storage resource broker. CASCON First Decade High Impact Papers, November 30–December 3 1998 (Reprint), pp. 189–200. doi: 10.1145/1925805.1925816
  28. 28.
    Guru, S.M., Kearney, M., Fitch, P., Peters, C.: Challenges in using scientific workflow tools in the hydrology domain. In: 18th World IMACS/MODSIM Congress, Cairns, Australia, 13–17 July 2009.
  29. 29.
    VIC: Variable Infiltration Capacity Macroscale Hydrologic Model.
  30. 30.
    RHESSys, Regional Hydro-Ecologic Simulation System.
  31. 31.
    Schaaff, A., Verdes-Montenegro, L., Ruiz, J.E., Santander Vela, J.: Scientific workflows in astronomy. In: Ballester, P., Egret, D., Lorente, N.P.F. (eds.) Proceedings of a Conference held at Marriott Rive Gauche Conference Center, Paris, France, 6–10 November 2011. ASP Conference Series, vol. 461, p. 875. Astronomical Society of the Pacific, San Francisco (2012)Google Scholar
  32. 32.
    Ghosh, S., Matsuoka, Y., Asai, Y., Hsin, K., Kitano, H.: Software for systems biology: from tools to integrated platforms. Nat. Rev. Genet. 12, 821–832 (2011). doi: 10.1038/nrg3096. Google Scholar
  33. 33.
    Jimenez, R.C., Corpas, M.: Bioinformatics workflows and web services in systems biology made easy for experimentalists. Methods Mol Biol. (2013). doi: 10.1007/978-1-62703-450-0_16. 1021:299-310. Google Scholar
  34. 34.
    NARR: NCEP North American Regional Reanalysis.
  35. 35.
    TPC: Transaction Processing Performance Council.

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Arcot Rajasekar
    • 1
  • Reagan Moore
    • 1
  • Shu Huang
    • 1
  • Yufeng Xin
    • 1
  1. 1.The University of North Carolina at Chapel HillChapel HillUSA

Personalised recommendations