Skip to main content
Log in

Potential thread-level-parallelism exploration with superblock reordering

  • Published:
Computing Aims and scope Submit manuscript

Abstract

The growing number of processing cores in a single CPU is demanding more parallelism from sequential programs. But in the past decades few work has succeeded in automatically exploiting enough parallelism, which casts a shadow over the many-core architecture and the automatic parallelization research. However, actually few work was tried to understand the nature, or amount, of the potentially available parallelism in programs. In this paper we will analyze at runtime the dynamic data dependencies among superblocks of sequential programs. We designed a meta re-arrange buffer to measure and exploit the available parallelism, with which the superblocks are dynamically analyzed, reordered and dispatched to run in parallel on an ideal many-core processor, while the data dependencies and program correctness are still maintained. In our experiments, we observed that with the superblock reordering, the potential speedup ranged from 1.08 to 89.60. The results showed that the potential parallelism of normal programs was still far from fully exploited by existing technologies. This observation makes the automatic parallelization a promising research direction for many-core architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Basupalli V, Yuki T, Rajopadhye S, Morvan A, Derrien S, Quinton P, Wonnacott D (2011) ompVerify: polyhedral analysis for the OpenMP programmer. In: Chapman B, Gropp W, Kumaran K, Müller M (eds) OpenMP in the Petascale Era, vol 6665., Lecture notes in computer scienceSpringer, Berlin, pp 37–53

    Chapter  Google Scholar 

  2. Blake G, Dreslinski RG, Mudge T, Flautner K (2010) Evolution of thread-level parallelism in desktop applications. In: ISCA ’10: Proceedings of the 37th annual international symposium on Computer architecture, ACM, New York, pp 302–313

  3. Hammacher C, Streit K, Hack S, Zeller A (2009) Profiling java programs for parallelism. In: Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering., IWMSE ’09IEEE Computer Society, Washington DC, pp 49–55

  4. Hammond L, Hubbert BA, Siu M, Prabhu MK, Chen M, Olukolun K (2000) The stanford hydra CMP. IEEE Micro 20(2):71–84

    Article  Google Scholar 

  5. Hwu WMW, Mahlke SA, Chen WY, Chang PP, Warter NJ, Bringmann RA, Ouellette RG, Hank RE, Kiyohara T, Haab GE, Holm JG, Lavery DM (1993) The superblock: an effective technique for VLIW and superscalar compilation. J Supercomput 7(1–2):229–248

    Article  Google Scholar 

  6. Islam MM (2007) On the limitations of compilers to exploit thread-Level parallelism in embedded applications. Computer and information science, ACIS international conference, pp 60–66

  7. Johnson M (1991) Superscalar microprocessor design

  8. Jones D, Marlow S, Singh S (2009) Parallel performance tuning for haskell. In: Proceedings of the 2nd ACM SIGPLAN symposium on Haskell, Haskell ’09ACM, New York, pp 81–92

  9. Nethercote N (2004) Dynamic binary analysis and instrumentation. Ph.D. thesis, Univ. of Cambridge, Cambridge

  10. Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89–100

    Article  Google Scholar 

  11. Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with CUDA. In: SIGGRAPH ’08: ACM SIGGRAPH 2008 classes, ACM, New York, pp 1–14

  12. Ooi CL, Kim SW, Park I, Eigenmann R, Falsafi B, Vijaykumar TN (2001) Multiplex: unifying conventional and speculative thread-level parallelism on a chip multiprocessor. In: Proceedings of the 15th international conference on supercomputing, ICS ’01ACM, New York, NY, pp 368–380

  13. Ottoni G, Rangan R, Stoler A, August DI (2005) Automatic thread extraction with decoupled software pipelining. In: Proceedings of the 38th annual IEEE/ACM international symposium on microarchitecture, MICRO 38IEEE computer society, Washington DC, pp 105–118

  14. Packirisamy V, Zhai A, Chung Hsu W, Chung Yew P (2009) Fook Ngai T Exploring speculative parallelism in SPEC2006. In: international symposium on performance analysis of systems and software, pp 77–88

  15. Schaumont PR (2010) Analysis of control flow and data flow a practical introduction to hardware/software codesign. A practical introduction to hardware/software codesign, chap., 3 Springer, Boston, pp 71–91

  16. Shobaki G, Wilken KD (2004) Optimal superblock scheduling using enumeration. In: international symposium on microarchitecture, pp 283–293

  17. Sohi GS, Breach SE, Vijaykumar TN (1995) Multiscalar processors. SIGARCH Comput Archit News 23:414–425

    Article  Google Scholar 

  18. Terboven C, An Sarholz S (2008) OpenMP on multicore architectures. In: A practical programming model for the multi-core era, pp 54–64

  19. Thornton JE (1965) Parallel operation in the control data 6600. In: Proceedings of the October 27–29., 1964, fall joint computer conference, Part II: very high speed computer systems, AFIPS ’64 (Fall, part II) ACM, New York, pp 33–40

  20. Tomasulo RM (1967) An efficient algorithm for exploiting multiple arithmetic units. IBM J Res Dev 11(1):25–33

    Article  MATH  Google Scholar 

  21. Wittenbrink CM, Kilgariff E, Prabhu A (2011) Fermi GF100 GPU architecture. Micro IEEE 31(2):50–59

    Article  Google Scholar 

  22. Ye J (2012) Potential parallelism analysis tool of sequential programs [source code]. https://github.com/zjutoe/fat

  23. Ye J, Chen T (2012) Exploring potential parallelism of sequential programs with superblock reordering. In: IEEE HPCC-2012

  24. Zhong H, Mehrara M, Lieberman SA, Mahlke SA (2008) Uncovering hidden loop level parallelism in sequential applications. In: International symposium on high-performance computer, architecture, pp 290–301

  25. Zier DA, Lee B (2010) Performance evaluation of dynamic speculative multithreading with the cascadia architecture. IEEE Tran Parallel Distrib Syst 21(1):47–59

    Article  Google Scholar 

Download references

Acknowledgments

This research is supported by the National Natural Science Foundation of China under Grant No. 61070001, the National Natural Science Foundation of Zhejiang Province No. LQ12F02017, the Special Funds for Key Program of the China No. 2011ZX0302-004-002 and 2012ZX01031001-003, the Key Science Foundation of Zhejiang Province under Grand No. 2010C11048, Open Fund of Mobile Netwok Application Technology Key Labotatory of Zhejiang Province, Innovation Group of New Generation of Mobile Internet Software and Services of Zhejiang Province.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John Ye.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ye, J., Yan, H., Hou, H. et al. Potential thread-level-parallelism exploration with superblock reordering. Computing 96, 545–564 (2014). https://doi.org/10.1007/s00607-014-0387-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-014-0387-8

Keywords

Mathematics Subject Classification

Navigation