Investigating Read Performance of Python and NetCDF When Using HPC Parallel Filesystems

  • Matthew Jones
  • Jon Blower
  • Bryan Lawrence
  • Annette Osprey
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9945)

Abstract

New methods need to be developed to handle the increasing size of data sets in atmospheric science - traditional analysis scripts often inefficiently read and process the data. NetCDF4 is a common file format used in atmospheric and ocean sciences, and Python is widely used in atmospheric and ocean science data analysis. The aim of this work is to provide insight into which read patterns and sizes are most effective when using the netCDF4-python library. Quantitative information on these would be useful information for scientists, library developers, and data managers.

Three different read patterns were compared to simulate different types of reads: sequential, strided, and random, with each tested across three file systems - Panasas, Lustre, and GPFS. Read rate and standard deviation were measured using Python and C, reading from plain binary files and NetCDF4 files. Read performance for netCDF4-python was compared with the performance of native Python, the C NetCDF library, and the C Posix library.

As expected, comparison between the different read modes shows that access pattern and read size significantly affect achieved performance. The results also show read performance profiles that are similar for the C, C NetCDF, and Python tests, however netCDF4-python performs less efficiently.

References

  1. 1.
  2. 2.
    Centre for Environmental Data Analysis. http://www.ceda.ac.uk/
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
    strace(1) - Linux man page. http://linux.die.net/man/1/strace
  8. 8.
    time - Time access and conversions. https://docs.python.org/2/library/time.html
  9. 9.
    Barton, E., Dilger, A.: High Performance Parallel I/O. CRC Press, Boca Raton (2015). Chap. 8, pp. 91–106Google Scholar
  10. 10.
    Bartz, C., Chasapis, K., Kuhn, M., Nerge, P., Ludwig, T.: A best practice analysis of HDF\(5\) and NetCDF-\(4\) using Lustre. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 274–281. Springer, Heidelberg (2015). doi:10.1007/978-3-319-20119-1_20 CrossRefGoogle Scholar
  11. 11.
    Blower, J., Gemmell, A., Griffiths, G., Haines, K., Santokhee, A., Yang, X.: A Web Map Service implementation for the visualization of multidimensional gridded environmental data, September 2013. http://centaur.reading.ac.uk/31396/12/ncWMS_paper_EMS_2013.pdf
  12. 12.
    Buck, J.B., Watkins, N., Lefevre, J., Maltzahn, C., Brandt, S.: SciHadoop : array-based query processing in Hadoop categories and subject descriptors. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 66 (2011)Google Scholar
  13. 13.
    Castain, R.H., Kulkarni, O., Zhenyu, X.: MapReduce and running Hadoop in a high performance computing environment Lustre : Agenda. In: Lustre User Group 2013, China and Japan (2013)Google Scholar
  14. 14.
    Cinquini, L., Crichton, D.J., Braverman, A.J., Kyo, L., Fuchs, T., Turmon, M.: Dawn: A Simulation Model for Evaluating Costs and Tradeoffs of Big Data Science Architectures. AGU Fall Meeting Abstracts, p. 3, December 2014Google Scholar
  15. 15.
    Gao, K., Jin, C., Choudhary, A., Liao, W.K.: Supporting computational data model representation with high-performance I/O in parallel netCDF. In: 18th International Conference on High Performance Computing, HiPC 2011 (2011)Google Scholar
  16. 16.
    Henty, D., Jackson, A., Moulinec, C., Szeremi, V.: Performance of Parallel IO on ARCHER (2015). http://www.archer.ac.uk/documentation/white-papers/parallelIO/ARCHER_wp_parallelIO.pdf
  17. 17.
    Hildebrand, D., Schmuck, F.: High Performance Parallel I/O. CRC Press, Boca Raton (2015). Chap. 9, pp. 91–106Google Scholar
  18. 18.
    Hübbe, N., Kunkel, J.: Reducing the HPC-datastorage footprint with MAFISC multidimensional adaptive filtering improved scientific data compression. Comput. Sci. Res. Dev. 28(2–3), 231–239 (2012)Google Scholar
  19. 19.
    Lawrence, B.N., Bennett, V.L., Churchill, J., Juckes, M., Kershaw, P., Pascoe, S., Pritchard, M., Stephens, A., Pepler, S.: Storing and manipulating environmental big data with JASMIN. In: IEEE Big Data 2013 (2013)Google Scholar
  20. 20.
    Lee, C., Yang, M., Aydt, R.: NetCDF-4 Performance Report. Technical report, HDF Group (2008)Google Scholar
  21. 21.
    Silberschatz, A., Baer Galvin, P., Gagne, G.: Operating System Concepts, 9th edn. Wiley, Hoboken (2013)Google Scholar
  22. 22.
    Srirama, S.N., Jakovits, P., Vainikko, E.: Adapting scientific computing problems to clouds using MapReduce. Future Gener. Comput. Syst. 28(1), 184–192 (2012)CrossRefGoogle Scholar
  23. 23.
    Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, pp. 307–320. USENIX Association, November 2006Google Scholar
  24. 24.
    Welch, B., Unangst, M., Abbasi, Z., Gibson, G., Mueller, B., Small, J., Zelenka, J., Zhou, B.: White paper scalable performance of the Panasas parallel file system. In: 6th USENIX Conference on File and Storage Technologies (FAST 2008), pp. 1–22, May 2010Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Matthew Jones
    • 1
  • Jon Blower
    • 1
  • Bryan Lawrence
    • 1
    • 2
    • 3
  • Annette Osprey
    • 1
    • 3
  1. 1.Department of MeteorologyUniversity of ReadingReadingUK
  2. 2.STFC Rutherford Appleton Laboratory, Centre for Environmental Data AnalysisDidcotUK
  3. 3.National Centre for Atmospheric ScienceManchesterUK

Personalised recommendations