Skip to main content

Advertisement

Log in

Efficient Loop Scheduling for Chip Multiprocessors with Non-Volatile Main Memory

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Non-volatile memories (NVMs) show great potential in replacing DRAM as the main memory in many embedded systems because of their attractive characteristics such as low cost, high density, and low energy consumption. However, the problem of asymmetric read and write costs has to be addressed before the advantages of NVM can be fully exploited. That is, the cost of write operation is much more expensive than the cost of read operation on NVMs. The existing techniques for loop optimization cannot be used effectively with non-volatile main memory because this special feature is not considered. In this paper, we propose an efficient loop scheduling algorithm, the Rotation with Maximum Bipartite Matching (RMBM) algorithm, to address the problem of expensive write operations on non-volatile main memory for chip multiprocessors (CMPs). It achieves high parallelism for a loop and, at the same time, reduces the number of write operations on NVM. The experimental results show that the RMBM algorithm reduces the number of write activities on NVM by 34.5 % on average compared with the traditional rotation scheduling algorithm. The execution time is reduced by 20.5 %, and the energy consumption is also reduced by 15.03 % on average using the RMBM algorithm. In other words, the average lifetime of NVM can be extended by more than 2 times using the proposed technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

References

  1. AMD (2005). http://www.cpushack.com/Am29k.html. Accessed 13 June 2005

  2. Chao, L.F. (1993). Schdeduling and Behavioral Transformations for Parallel Systems. Ph.D. thesis, Princeton University, USA.

  3. Chao, L.F., Lapaugh, A.S., Sha, E.H.M. (1997). Rotation scheduling: a loop pipelining algorithm. IEEE TCAD, 16(3), 229–239.

    Google Scholar 

  4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. (2002). Introduction to Algorithms (3rd edn.). MIT Prees, USA.

    Google Scholar 

  5. Hakduran, K., Mahmut, K., Ehat, E., Ozcan, O. (2007). Reducing off-chip memory access costs using data recomputation in embedded chip multi-processors. In: DAC ’07 (pp. 224–229). New York, USA.

    Google Scholar 

  6. Hofstee, H.P. (2005). Power effcient processor architecure and the cell processor. In: HPCA ’05 (pp. 258–262). San Francisco, California, USA.

  7. Hu, J.T., Xue, C.J., Tseng, W.C., et al. (2010). Reducing write activities on non-volatile memories in embedded cmps via data migration and recomputation. In: DAC’10 (pp. 350–355). Anaheim, California, USA.

  8. Leiserson, C., Rose, F., Saxe, J.B. (1983). Optimizing synchronous circuitry by retiming. In: 3rd Caltech Conf. VLSI (pp. 87–116). California, USA.

  9. Leiserson, C., Rose, F., Saxe, J.B. (1991). Retiming synchronous circuitry. Algorithmica, 6, 5–35.

    Article  MathSciNet  MATH  Google Scholar 

  10. Liu, D., Wang, T.Z., Wang, Y., Qin, Z.W., Shao, Z.L. (2011). Pcm-ftl: a write-activity-aware nand flash memory management scheme for pcm-based embedded systems. In: RTSS’11 (pp. 357–366). Vienna, Austria.

  11. Liu, D., Wang, T.Z., Wang, Y., Qin, Z.W., Shao, Z.L. (2012). A block-level flash memory management scheme for reducing write activities in pcm-based embedded systems. In: DATE’12 (pp. 1447–1450). Dresden, mGermany.

  12. Liu, D., Wang, Y., Shao, Z., Guo, M., Xue, J. (2012). Optimally maximizing iteration-level loop parallelism. IEEE TPDS, 23(3), 564–572.

    Google Scholar 

  13. Project, P. (2010). http://www.pdl.cmu.edu/NVM/index.shtml. Accessed 15 July 2011

  14. Qureshi, M.K., Srinivasan, V., Rivers, J.A. (2009). Scalable high performance main memory system using phase-change memory technology. In: ISCA ’09 (pp. 24–33). Austin, Texas, USA.

  15. Samsung (2011). http://www.eetimes.com/electronics-news/4230958/ISSCC–Samsung-preps-8-Gbit-phase-change-memory. Accessed 2 Dec 2011

  16. Shi, L., Xue, C.J., Hu, J.T., Tseng, W.C., Zhou, X.H., Sha, E.H.M. (2010). Write activity reduction on flash main memory via smart victim cache. In: GLSVLSI ’10 (pp. 91–94). Providence, Rhode Island.

    Google Scholar 

  17. Sony (2011). http://www.enet.com.cn/article/2012/0223/. Accessed 23 Feb 2012

  18. Stefanov, T., Kienhuis, B., Deprettere, E. (2002). Algorithmic transformation techniques for efficient exploration of alternative application instances. In: CODES ’02 (pp. 7–12). Estes Park CO, USA.

    Chapter  Google Scholar 

  19. TI (2011). http://www.engadget.com/2011/06/02/texas-instruments-announces-multi-core-1-8ghz-omap4470-arm-proc/. Accessed 2 June 2011

  20. Xue, C., Shao, Z.L., Liu, M.L., Qiu, M.K., Sha, E.H.M. (2005). Optimizing nested loops with iterational and instructional retiming. In: EUC (pp. 164–173). Nagasaki, Japan.

    Google Scholar 

  21. Xue, C.J., Hu, J., Shao, Z., Sha, E.H.M. (2010). Iterational retiming with partitioning: loop scheduling with complete memory latency hiding. ACM TECS, 9(3), 1–26.

    Article  Google Scholar 

  22. Xue, C.J., Jia, Z., Shao, Z., Wang, M., Sha, E.H.M. (2008). Optimized address assignment with array and loop transformations for minimizing schedule length. IEEE TCS, 55(1), 379–389.

    MathSciNet  Google Scholar 

  23. Zhou, P., Zhao, B., Yang, J., Zhang, Y. (2009). A durable and energy efficient main memory using phase change memory technology. In: ISCA ’09 (pp. 14–23). Austin, Texas, USA.

  24. Zhuge, Q.F., Xiao, B., Sha, E.H.M. (2003). Code size reduction technique and implementation for software-pipelined dsp application. ACM TECS, 2(4), 590–613.

    Article  Google Scholar 

  25. Zhuge, Q.F., Xue, C., Shao, Z.L., Liu, M.L., Qiu, M.K., Sha, E.H.M. (2006). Design optimization and space minimization considering timing and code size via retiming and unfolding. Microprocessors and Microsystems, 30(4), 173–183.

    Article  Google Scholar 

  26. Zivojnovic, V., Martinez, J., Schlager, C., Meyr, H. (1994). Dspstone: a dsp-oriented benchmarking methodology. In: ICSPAT’94. Dallas, Texas, USA.

Download references

Acknowledgements

This work is partially supported by NSF CNS-1015802, Texas NHARP 009741-0020-2009, HK GRF 123609, NSFC 61173014, NSFC 61133005, NSFC 61173036, China Thousand-Talent Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingfeng Zhuge.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, J., Wang, Y., Zhuge, Q. et al. Efficient Loop Scheduling for Chip Multiprocessors with Non-Volatile Main Memory. J Sign Process Syst 71, 261–273 (2013). https://doi.org/10.1007/s11265-012-0703-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-012-0703-5

Keywords

Navigation