Advertisement

ParaShares: Finding the Important Basic Blocks in Multithreaded Programs

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8632)

Abstract

Understanding and optimizing multithreaded execution is a significant challenge. Numerous research and industrial tools debug parallel performance by combing through program source or thread traces for pathologies including communication overheads, data dependencies, and load imbalances. This work takes a new approach: it ignores any underlying pathologies, and focuses instead on pinpointing the exact locations in source code that consume the largest share of execution. Our new metric, ParaShares, scores and ranks all basic blocks in a program based on their share of parallel execution. For the eight benchmarks examined in this paper, ParaShare rankings point to just a few important blocks per application. The paper demonstrates two uses of this information, exploring how the important blocks vary across thread counts and input sizes, and making modest source code changes (fewer than 10 lines of code) that result in 14-92% savings in parallel program runtime.

Keywords

Basic Block Parallel Execution Input Size Multithreaded Program Instruction Count 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation 22(6) (April 2010)Google Scholar
  2. 2.
    Anderson, T.E., Lazowska, E.D.: Quartz: A tool for tuning parallel program performance. SIGMETRICS 18, 115–125 (1990)CrossRefGoogle Scholar
  3. 3.
    Bienia, C.: Benchmarking Modern Multiprocessors. PhD thesis. Princeton University (2011)Google Scholar
  4. 4.
    Bohme, D., Wolf, F., de Supinski, B.R., Schulz, M., Geimer, M.: Scalable critical-path based performance analysis. In: IPDPS (2012)Google Scholar
  5. 5.
    Chabbi, M., Mellor-Crummey, J.: Deadspy: A tool to pinpoint program inefficiencies. In: CGO (2012)Google Scholar
  6. 6.
    Chen, G., Stenstrom, P.: Critical lock analysis: Diagnosing critical section bottlenecks in multithreaded applications. In: SC (2012)Google Scholar
  7. 7.
    Chen, K.-Y., Chang, J., Hou, T.-W.: Multithreading in Java: Performance and scalability on multicore systems. Transactions on Computers 60(11) (November 2011)Google Scholar
  8. 8.
    Du Bois, K., Eyerman, S., Sartor, J.B., Eeckhout, L.: Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior. In: ISCA (2013)Google Scholar
  9. 9.
    Du Bois, K., Sartor, J.B., Eyerman, S., Eeckhout, L.: Bottle graphs: Visualizing scalability bottlenecks in multi-threaded applications. In: OOPSLA (2013)Google Scholar
  10. 10.
    Granlund, T.: Instruction latencies and throughput for AMD and Intel x86 processors (February 2012), http://gmplib.org/~tege/x86-timing.pdf
  11. 11.
    Harmony Parallel Block Vector Collection Tool, http://arcade.cs.columbia.edu/harmony
  12. 12.
    He, Y., Leiserson, C.E., Leiserson, W.M.: The Cilkview scalability analyzer. In: SPAA, pp. 145–156 (2010)Google Scholar
  13. 13.
    Huang, X., Blackburn, S.M., McKinley, K.S., Moss, J.E.B., Wang, Z., Cheng, P.: The garbage collection advantage: Improving program locality. In: OOPSLA (October 2004)Google Scholar
  14. 14.
    Huang, Y., Cui, Z., Chen, L., Zhang, W., Bao, Y., Chen, M.: HaLock: Hardware-assisted lock contention detection in multithreaded applications. In: PACT (2012)Google Scholar
  15. 15.
    Intel® Corporation. Intel® Parallel Amplifier (2011), http://software.intel.com/en-us/articles/intel-parallel-amplifier/
  16. 16.
    Joao, J.A., Suleman, M.A., Mutlu, O., Patt, Y.N.: Bottleneck identification and scheduling in multithreaded applications. In: ASPLOS (2012)Google Scholar
  17. 17.
    Kambadur, M., Tang, K., Kim, M.A.: Harmony: Collection and analysis of parallel block vectors. In: ISCA (June 2012)Google Scholar
  18. 18.
    Miller, B.P., Callaghan, M.D., Cargille, J.M., Hollingsworth, J.K., Bruce, R., Karen, I., Karavanic, L., Kunchithapadam, K., Newhall, T.: The Paradyn parallel performance measurement tools. IEEE Computer (1995)Google Scholar
  19. 19.
    Perelman, E., Hamerly, G., Van Biesbrouck, M., Sherwood, T., Calder, B.: Using simpoint for accurate and efficient simulation. In: SIGMETRICS, vol. 31. ACM (2003)Google Scholar
  20. 20.
    Shi, H., Wang, Y., Guan, H., Liang, A.: An intermediate language level optimization framework for dynamic binary translation. SIGPLAN Notices 42(5) (May 2007)Google Scholar
  21. 21.
    STMicroelectronics, Inc. PGProf: Parallel profiling for scientists and engineers (2011), http://www.pgroup.com/products/pgprof.htm
  22. 22.
    Stroustrup, B.: C++11 the new ISO C++ standard (2013), http://www.stroustrup.com/C++11FAQ.html
  23. 23.
    Tallent, N.R., Mellor-Crummey, J.M., Porterfield, A.: Analyzing lock contention in multithreaded applications. In: PPoPP (2010)Google Scholar
  24. 24.
    Topham, N., Jones, D.: High speed CPU simulation using JIT binary translation. In: MOBS, vol. 7 (2007)Google Scholar
  25. 25.
    Woo, S., Ohara, M., Torrie, E., Singh, J., Gupta, A.: The SPLASH-2 programs: Characterization and methodological considerations. In: ISCA (1995)Google Scholar
  26. 26.
    Yoo, W., Larson, K., Baugh, L., Kim, S., Campbell, R.H.: ADP: Automated diagnosis of performance pathologies using hardware events. In: SIGMETRICS (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Columbia UniversityNew YorkUSA

Personalised recommendations