Skip to main content

Advertisement

Log in

An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

This paper presents a memory organization for SDR inner modem baseband processors that focus on exploiting ILP. This memory organization uses power-efficient, single-ported, interleaved scratch-pad memory banks to provide enough bandwidth to a high-ILP processors. A system of queues in the memory interface is used to resolve bank conflicts among the single-ported banks, and to spread long bursts of conflicting accesses to the same bank over time. Bank address rotation is used to spread long bursts of conflicting accesses over multiple banks. All proposed techniques have been implemented in hardware, and are evaluated for a number of different wireless communication standards. For the 11a|n benchmarks, the overhead of stall cycles resulting from unresolved bank conflicts can be reduced to below 2% with the proposed organization. For 3GPP-LTE, the most demanding wireless standard we evaluated, the overhead is reduced to less than 0.13%. This is achieved with little energy and area overhead, and without any bank-aware compiler support.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12

Similar content being viewed by others

Notes

  1. Finding valid schedules on our wide ADRES, which is in fact a coarse-grained reconfigurable array of 16 ALUs and 13 register files with a sparse interconnect, is very complex. To find those schedules, we currently rely on simulated-annealing, which is quite slow. Making the compiler bank-aware might make it even slower. Also, as we will see in the evaluation section, we provide a solution based on rather cheap hardware that makes bank-aware code generation unnecessary altogether.

References

  1. Baert, R., de Greef, E., & Brockmeyer, E. (2008). An automatic scratch pad memory management tool and MPEG-4 encoder case study. In DAC ’08: Proceedings of the 45th annual Design Automation Conference (201–204). Anaheim, California:ACM. doi:10.1145/1391469.1391520

  2. Barua, R. (2000). Maps: A compiler-managed memory system for software-exposed architectures. Ph.D. thesis, Massachusetss Institute of Technology.

  3. Bastoul, C., Cohen, A., Girbal, S., Sharma, S., & Temam, O. (2003). Putting polyhedral loop transformations to work. In Proc. workshop on languages and compilers for parallel computing (LCPC’03) (pp. 23–30).

  4. Bougard, B., De Sutter, B., Rabou, S., Dupont, S., Allam, O., Novo, D., et al. (2008). A coarse-grained array based baseband processor for 100MBPS+ software defined radio. In Proc. of design, automation, and test in Europe (DATE 2008).

  5. Bougard, B., De Sutter, R., Verkest, D., Van der Perre, L., & Lauwereins, R. (2008). A coarse-grained array accelerator for software-defined radio baseband processing. IEEE Micro, 28(4), 41–50.

    Article  Google Scholar 

  6. Castro, F., Chaver, D., Pinuel, L., Prieto, M., Tirado, F., & Huang, M. (2005). Load-store queue management: An energy-efficient design based on a state-filtering mechanism. In Proceedings. 2005 IEEE international conference on computer design: VLSI in computers and processors, 2005. ICCD 2005, (pp. 617–624), 2–5 Oct. 2005.

  7. Catthoor, F., Danckaert, K., Kulkarni, C., Brockmeyer, E., Kjeldsberg, P., Van Achteren, T., et al. (2002). Data access and storage management for embedded programmable processors. Boston: Kluwer.

    MATH  Google Scholar 

  8. Chen, S., & Postula, A. (2000). Synthesis of custom interleaved memory systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 8(1), 74–83.

    Article  Google Scholar 

  9. Delaluz, V., Kandemir, M. T., Vijaykrishnan, N., Irwin, M. J., Sivasubramaniam, A., & Kolcu, I. (2002). Compiler-directed array interleaving for reducing energy in multi-bank memories. In VLSI design (pp. 288–293).

  10. Derudder, V., Bougard, B., Couvreur, A., Dewilde, A., Dupont, S., Folens, A., et al. (2009). A 200 Mbps + 2.14 nJ/b digital baseband multi processor system-on-chip for SDRs. In Proc of VLSI symposum.

  11. De Sutter, B., Coene, P., Vander Aa, T., & Mei, B. B. (2008). Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on languages, compilers, and tools for embedded systems (pp. 151–160).

  12. Friedman, S., Carroll, A., Van Essen, B., Ylvisaker, B., Ebeling, C., & Hauck, S. S. (2009). SPR: An architecture-adaptive CGRA mapping tool. In FPGA ’09: Proceeding of the ACM/SIGDA international symposium on field programmable gate arrays (pp. 191–200).

  13. Harper, D. (1991). Block, multistride vectorm and FFT accesses in parallel memory systems. IEEE Transactions on Parallel and Distributed Systems, 2(1), 43–51.

    Article  Google Scholar 

  14. Harper, D., & Linebarger, D. (1991). Conflict-free vector access using a dynamic storage scheme. IEEE Transactions on Computers, 40(3), 276–283.

    Article  Google Scholar 

  15. Hive. HiveFlex CSP2000 series, programmable OFDM communication signal processor. http://www.siliconhive.com.

  16. Hur, I., & Lin, C. (2007). Memory scheduling for modern microprocessors. ACM Transactions on Computer Systems, 25(4), 10.

    Article  Google Scholar 

  17. Kandemir, M., Ramanujam, J., & Choudhary, A. (1999). Improving cache locality by a combination of loop and data transformations. IEEE Transactions on Computers, 48(2), 159–167.

    Article  Google Scholar 

  18. Lam, M. S. (1988). Software pipelining: An effecive scheduling technique for VLIW machines. In Proc. PLDI (pp. 318–327).

  19. Mahlke, S., Lin, D., Chen, W. Y., Hank, R., & Bringmann, R. (1992). Effective compiler support for predicated execution using the hyperblock. In MICRO 25: Proceedings of the 25th annual international symposium on microarchitecture (pp. 45–54).

  20. Mei, B., Vernalde, S., Verkest, D., & Lauwereins, R. (2004). Design methodology for a tightly coupled VLIW/reconfigurable matrix arcchitecture: A case study. In Proc. of design, automation and test in Europe (DATE 2004) (pp. 1224–1229).

  21. Mei, B., Vernalde, S., Verkest, D., Man, H. D., & Lauwereins, R. (2003). Exploiting loop-level parallelism for coarse-grained reconfigurable architecture using modulo scheduling. IEE Proceedings: Computer and Digital Techniques, 150(5).

  22. Novo, D., Schuster, T., Bougard, B., Lambrechts, A., Van der Perre, L., & Catthoor, F. (2008). Energy-performance exploration of a CGA-based SDR processor. Journal of Signal Processing Systems.

  23. Oh, T., Egger, B., Park, H., & Mahlke, S. (2009). Recurrence cycle aware modulo scheduling for coarse-grained reconfiguralbe architectures. In Proceedings of the 2009 ACM SIGPLAN-SIGBED conference on languages, compilers, and tools for embedded systems (pp. 21–30).

  24. PACT (2006). PACT XPP technologies. http://www.pactcorp.com.

  25. Park, I., Ooi, C. L., & Vijaykumar, T. N. (2003). Reducing design complexity of the load/store queue. In Proc. of the 36th international symposium on microarchitecture (MICRO-36).

  26. Park, H., Fan, K., Mahlke, S. A. , Oh, T., Kim, H., & Kim, H.-S. (2008). Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In PACT ’08: Proceedings of the 17th international conference on parallel architectures and compilation techniques (pp. 166–176).

  27. Pitkänen, T., Tanskanen, J., Mäkinen, R., & Takala, J. (2008). Parallel memory architecture for application-specific instruction-set processors. Journal of Signal Processing Systems.

  28. Rau, B. R. (1991). Pseudo-randomly interleaved memory. In ISCA ’91: Proceedings of the 18th annual international symposium on computer architecture (pp. 74–83).

  29. Rau, B. R. (1995). Iterative modulo scheduling. Technical report, Hewlett-Packard Lab: HPL-94-115.

  30. Rivers, J. A., Tyson, G. S., Davidson, E. S., & Austin, T. M. (1997) On high-bandwidth data cache design for multi-issue processors. In Proc. of the 30th international symposium on microarchitecture (MICRO-30).

  31. Sethumadhavan, S., Roesner, F., Emer, J. S., Burger, D., & Keckler, S. W. (2007). Late-binding: Enabling unordered load-store queues. In D. M. Tullsen & B. Calder (Eds.), ISCA (pp. 347–357). ACM.

  32. Singh, H., Lee, M.-H., Lu, G., Kurdahi, F. J., Bagherzadeh, N., & Filho, E. M. C. (2000). MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers, 49(5), 465–481.

    Article  Google Scholar 

  33. So, B., Hall, M. W., & Ziegler, H. E. (2004) Custom data layout for memory parallelism. In Proc. of international symposium on code generation and optimization (CGO).

  34. Subramaniam, S., & Loh, G. H. (2006). Fire-and-forget: Load/store scheduling with no store queue at all. In Proceedings of the 39th annual IEEE/ACM international symposium on microarchitecture (pp. 273–284).

  35. Tanskanen, J., & Creutzburg, a. J. N. R. (2005). On design of parallel memory access schemes for video coding. Journal of VLSI Signal Processing Systems, 40, 215–237.

    Article  Google Scholar 

  36. Taylor, M., Kim, J., Miller, J., Wentzla, D., Ghodrat, F., Greenwald, B., et al. (2002). The raw microprocessor: A computational fabric for software circuits and general purpose programs. IEEE Micro, 22(2), 25–35.

    Article  Google Scholar 

  37. Valero, L., Lang, T., Peiron, M., & Ayguadé, E. (1995). Conflict-free access for streams in multimodule memories. IEEE Transactions on Computers, 44(5), 634–646.

    Article  MATH  Google Scholar 

  38. van Berkel, K., Heinle, F., Meuwissen, P., Moerman, K., & Weiss, M. (2005). Vector processing as an enabler for software-defined radio in handheld devices. EURASIP Journal on Applied Signal Processing, 2005, 2613–2625.

    Article  Google Scholar 

  39. Wehmeyer, L., & Marwedel, P. (2006). Fast, efficient, and predictable memory accesses. New York: Springer.

    MATH  Google Scholar 

  40. Zhang, Z., Zhu, Z., & Zhang, X. (2000). A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In International symposium on microarchitecture (pp. 32–41).

  41. Zhuang, X., Pande, S., & J. S. G. Jr. (2002). A Framework for parallelizing load/stores on embedded processors. In Proc. of international conference on parallel architectures and compilation techniques (PACT 2002).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bjorn De Sutter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Sutter, B., Allam, O., Raghavan, P. et al. An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors. J Sign Process Syst 61, 157–179 (2010). https://doi.org/10.1007/s11265-009-0412-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-009-0412-x

Keywords

Navigation