Advertisement

From Application to Disk: Tracing I/O Through the Big Data Stack

  • Robert SchmidtkeEmail author
  • Florian Schintke
  • Thorsten Schütt
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11203)

Abstract

Typical applications in data science consume, process and produce large amounts of data, making disk I/O one of the dominating—and thus worthwhile optimizing—factors of their overall performance. Distributed processing frameworks, such as Hadoop, Flink and Spark, hide a lot of complexity from the programmer when they parallelize these applications across a compute cluster. This exacerbates reasoning about I/O of both the application and the framework, through the distributed file system, such as HDFS, down to the local file systems.

We present SFS (Statistics File System), a modular framework to trace each I/O request issued by the application and any JVM-based big data framework involved, mapping these requests to actual disk I/O.

This allows detection of inefficient I/O patterns, both by the applications and the underlying frameworks, and builds the basis for improving I/O scheduling in the big data software stack.

Keywords

Distributed File Systems I/O analysis Big data ecosystems 

Notes

Acknowledgments

This work received funding from the BMBF projects Berlin Big Data Center (BBDC) under grant 01IS14013B and GeoMultiSens under grant 01IS14010C.

References

  1. 1.
    Anderson, J.M., et al.: Continuous profiling: where have all the cycles gone? ACM Trans. Comput. Syst. 15(4), 1–14 (1997)CrossRefGoogle Scholar
  2. 2.
  3. 3.
    Apache Software Foundation: org.apache.hadoop.examples.terasort package description (2017). https://hadoop.apache.org/docs/r2.7.4/api/org/apache/hadoop/examples/terasort/package-summary.html#package.description
  4. 4.
    Chang, F., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008).  https://doi.org/10.1145/1365815.1365816CrossRefGoogle Scholar
  5. 5.
    Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T.F.: The mystery machine: end-to-end performance analysis of large-scale Internet services. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014), Broomfield, CO, pp. 217–231. USENIX Association (2014). https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chow
  6. 6.
    Conrad, T.O.F., et al.: Sparse proteomics analysis - a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data. BMC Bioinform. 18(1), 160 (2017).  https://doi.org/10.1186/s12859-017-1565-4CrossRefGoogle Scholar
  7. 7.
    Côté, R.G., Reisinger, F., Martens, L.: jmzML, an open-source Java API for mzML, the PSI standard for MS data. Proteomics 10(7), 1332–1335 (2010).  https://doi.org/10.1002/pmic.200900719CrossRefGoogle Scholar
  8. 8.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008).  https://doi.org/10.1145/1327452.1327492CrossRefGoogle Scholar
  9. 9.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003).  https://doi.org/10.1145/1165389.945450CrossRefGoogle Scholar
  10. 10.
    Graham, S.L., Kessler, P.B., Mckusick, M.K.: Gprof: a call graph execution profiler. In: ACM SIGPLAN Notices, vol. 17, pp. 120–126. ACM (1982)Google Scholar
  11. 11.
    Gregg, B.: Linux performance, August 2017. http://www.brendangregg.com/linuxperf.html
  12. 12.
    Harter, T., et al.: Analysis of HDFS under HBase: a Facebook messages case study. In: Proceedings of the 12th USENIX Conference on File and Storage Technologies, FAST 2014, pp. 199–212. USENIX Association, Berkeley (2014). http://dl.acm.org/citation.cfm?id=2591305.2591325
  13. 13.
    Joukov, N., Traeger, A., Iyer, R., Wright, C.P., Zadok, E.: Operating system profiling via latency analysis. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, pp. 89–102. USENIX Association (2006)Google Scholar
  14. 14.
    Kim, S., Kim, H., Jonwoon, L., Jeong, J.: Enlightening the I/O path: a holistic approach for application performance. In: Proceedings of the 15th USENIX Conference on File and Storage Technologies, FAST 2017, pp. 345–358. USENIX Association, Berkeley (2017). https://www.usenix.org/system/files/conference/fast17/fast17-kim-sangwook.pdf
  15. 15.
    Liu, Y., Gunasekaran, R., Ma, X., Vazhkudai, S.S.: Automatic identification of application I/O signatures from noisy server-side traces. In: Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 2014), Santa Clara, CA, pp. 213–228. USENIX (2014). https://www.usenix.org/conference/fast14/technical-sessions/presentation/liu
  16. 16.
    Mace, J., Roelke, R., Fonseca, R.: Pivot tracing: dynamic causal monitoring for distributed systems. In: Proceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, pp. 378–393. ACM, New York (2015).  https://doi.org/10.1145/2815400.2815415
  17. 17.
    O’Malley, O.: Terabyte sort on Apache Hadoop, May 2008. http://sortbenchmark.org/YahooHadoop.pdf
  18. 18.
    O’Malley, O., Murthy, A.C.: Winning a 60 second dash with a yellow elephant, April 2009. https://pdfs.semanticscholar.org/176b/c836e106bfdfe818adfc9dc1c0b150d85e54.pdf
  19. 19.
    Ousterhout, J.K., Da Costa, H., Harrison, D., Kunze, J.A., Kupfer, M., Thompson, J.G.: A trace-driven analysis of the UNIX 4.2 BSD file system, vol. 19. ACM (1985)Google Scholar
  20. 20.
    Stefanovici, I., Schroeder, B., O’Shea, G., Thereska, E.: Treating the storage stack like a network. Trans. Storage 13(1), 2:1–2:27 (2017).  https://doi.org/10.1145/3032968CrossRefGoogle Scholar
  21. 21.
    Thereska, E., et al.: IOFlow: a software-defined storage architecture. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, pp. 182–196. ACM, New York (2013).  https://doi.org/10.1145/2517349.2522723
  22. 22.
    Thompson, K., Ritchie, D.M.: Unix Programmer’s Manual, 5 edn, June 1974Google Scholar
  23. 23.
    Transaction Processing Performance Council: TPC Express Benchmark™HS, June 2017. http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-hs_v2.0.1.pdf
  24. 24.
    White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc., Sebastopol (2015)Google Scholar
  25. 25.
    Zadok, E., Nieh, J.: Fist: a language for stackable file systems. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC 2000, p. 5. USENIX Association, Berkeley (2000). http://dl.acm.org/citation.cfm?id=1267724.1267729CrossRefGoogle Scholar
  26. 26.
    Zhao, X., Rodrigues, K., Luo, Y., Yuan, D., Stumm, M.: Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 603–618. USENIX Association (2016). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/zhao

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Zuse Institute BerlinBerlinGermany

Personalised recommendations