Folding: Detailed Analysis with Coarse Sampling

  • Harald Servat
  • Germán Llort
  • Judit Giménez
  • Kevin Huck
  • Jesús Labarta
Conference paper


Performance analysis tools help the application users to find bottlenecks that prevent the application to run at full speed in current supercomputers. The level of detail and the accuracy of the performance tools are crucial to completely depict the nature of the bottlenecks. The details exposed do not only depend on the nature of the tools (profile-based or trace-based) but also on the mechanism on which they rely (instrumentation or sampling) to gather information.In this paper we present a mechanism called folding that combines both instrumentation and sampling for trace-based performance analysis tools. The folding mechanism takes advantage of long execution runs and low frequency sampling to finely detail the evolution of the user code with minimal overhead on the application. The reports provided by the folding mechanism are extremely useful to understand the behavior of a region of code at a very low level. We also present a practical study we have done in a in-production scenario with the folding mechanism and show that the results of the folding resembles to high frequency sampling.


Computation Region Performance Counter High Frequency Sampling Folding Mechanism User Routine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is granted by the IBM/BSC MareIncognito project and by the Comisión Interministerial de Ciencia y Tecnología (CICYT) under Contract No. TIN2007-60625.


  1. 1.
    Azimi, R., et al.: Online performance analysis by statistical sampling of microprocessor performance counters. In: ICS ’05: Proceedings of the 19th Annual International Conference on Supercomputing, pp. 101–110. ACM, New York (2005). doi:
  2. 2.
    Bézier, P.: Numerical Control. Mathematics and Applications. Wiley, London (1972). Translated by: A.R. Forrest and Anne F. PakhurstGoogle Scholar
  3. 3.
  4. 4.
    Extrae Instrumentation Package. Accessed August 2012
  5. 5.
    González, J., et al.: Automatic detection of parallel applications computation phases. In: IPDPS’09: 23rd IEEE International Parallel and Distributed Processing Symposium, Rome, Italy. IEEE Computer Society, Piscataway (2009)Google Scholar
  6. 6.
    González, J., et al.: Automatic evaluation of the computation structure of parallel applications. In: PDCAT ’09: Proceedings of the 10th International Conference on Parallel and Distributed Computing, Applications and Technologies, Hiroshima, Japan. IEEE Computer Society, Hiroshima (2009)Google Scholar
  7. 7.
    Graham, S.L., et al.: Gprof: a call graph execution profiler. In: SIGPLAN ’82: Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction, pp. 120–126. ACM, New York (1982). doi:
  8. 8.
    Itzkowitz, M.: Sun studio performance analyzer. Accessed August 2012
  9. 9.
    Llort, G., et al.: On-line detection of large-scale parallel application’s structure. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 19–23 April 2010, pp. 1–10. doi: 10.1109/IPDPS.2010.5470350. URL: (2010)Google Scholar
  10. 10.
    Morris, A., et al.: Design and implementation of a hybrid performance measurement and sampling system. In: ICPP 2010: Proceedings of the 2010 International Conference on Parallel Processing, San Diego, California (2010)Google Scholar
  11. 11.
    NAS Parallel Benchmark Suite. Accessed August 2012
  12. 12.
    Pillet, V., et al.: Paraver: a tool to visualize and analyze parallel code. In: Nixon, P. (ed.) Transputer and occam Developments, pp. 17–32. IOS Press, Amsterdam (1995). Accessed July 2011
  13. 13.
    Servat, H., et al.: Detailed performance analysis using coarse grain sampling. In: Euro-Par Workshops (Workshop on Productivity and Performance, PROPER), Delft, The Netherlands pp. 185–198. Springer Berlin, Heidelberg (2009)Google Scholar
  14. 14.
    Servat, H., et al.: Unveiling internal evolution of parallel application computation phases. In: ICPP’11: International Conference on Parallel Processing, Taipei, Taiwan (2011)Google Scholar
  15. 15.
    Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006). doi: Google Scholar
  16. 16.
    Simpson, A.D., Bull, M., Hill, J.: Identification and categorisation of applications and initial benchmarks suite (2008). Accessed July 2011
  17. 17.
    Tallent, N., et al.: Hpctoolkit: performance tools for scientific computing. J. Phys. Conf. Ser. 125(1), 012088 (2008)Google Scholar
  18. 18.
    Trochu, F.: A contouring program based on dual Kriging interpolation. Eng. Comput. 9(3), 160–177 (1993)Google Scholar
  19. 19.
    Wolf, F., et al.: Usage of the SCALASCA for scalable performance analysis of large-scale parallel applications. In: Tools for High Performance Computing, pp. 157–167. Springer, Berlin/Heidelberg (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Harald Servat
    • 1
  • Germán Llort
    • 1
  • Judit Giménez
    • 1
  • Kevin Huck
    • 1
  • Jesús Labarta
    • 1
  1. 1.Barcelona Supercomputing CenterUniversitat Politècnica de CatalunyaBarcelona, CatalunyaSpain

Personalised recommendations