An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

De Sutter, Bjorn; Allam, Osman; Raghavan, Praveen; Vandebriel, Roeland; Cappelle, Hans; Vander Aa, Tom; Mei, Bingfeng

doi:10.1007/s11265-009-0412-x

An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Published: 14 October 2009

Volume 61, pages 157–179, (2010)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Bjorn De Sutter¹,
Osman Allam²,
Praveen Raghavan²,
Roeland Vandebriel²,
Hans Cappelle²,
Tom Vander Aa² &
…
Bingfeng Mei²

233 Accesses
7 Citations
Explore all metrics

Abstract

This paper presents a memory organization for SDR inner modem baseband processors that focus on exploiting ILP. This memory organization uses power-efficient, single-ported, interleaved scratch-pad memory banks to provide enough bandwidth to a high-ILP processors. A system of queues in the memory interface is used to resolve bank conflicts among the single-ported banks, and to spread long bursts of conflicting accesses to the same bank over time. Bank address rotation is used to spread long bursts of conflicting accesses over multiple banks. All proposed techniques have been implemented in hardware, and are evaluated for a number of different wireless communication standards. For the 11a|n benchmarks, the overhead of stall cycles resulting from unresolved bank conflicts can be reduced to below 2% with the proposed organization. For 3GPP-LTE, the most demanding wireless standard we evaluated, the overhead is reduced to less than 0.13%. This is achieved with little energy and area overhead, and without any bank-aware compiler support.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application of the Scalable Communications Core as an SDR Baseband

Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

Towards Efficient Dynamic LLC Home Bank Mapping with NoC-Level Support

Notes

Finding valid schedules on our wide ADRES, which is in fact a coarse-grained reconfigurable array of 16 ALUs and 13 register files with a sparse interconnect, is very complex. To find those schedules, we currently rely on simulated-annealing, which is quite slow. Making the compiler bank-aware might make it even slower. Also, as we will see in the evaluation section, we provide a solution based on rather cheap hardware that makes bank-aware code generation unnecessary altogether.

References

Baert, R., de Greef, E., & Brockmeyer, E. (2008). An automatic scratch pad memory management tool and MPEG-4 encoder case study. In DAC ’08: Proceedings of the 45th annual Design Automation Conference (201–204). Anaheim, California:ACM. doi:10.1145/1391469.1391520
Barua, R. (2000). Maps: A compiler-managed memory system for software-exposed architectures. Ph.D. thesis, Massachusetss Institute of Technology.
Bastoul, C., Cohen, A., Girbal, S., Sharma, S., & Temam, O. (2003). Putting polyhedral loop transformations to work. In Proc. workshop on languages and compilers for parallel computing (LCPC’03) (pp. 23–30).
Bougard, B., De Sutter, B., Rabou, S., Dupont, S., Allam, O., Novo, D., et al. (2008). A coarse-grained array based baseband processor for 100MBPS+ software defined radio. In Proc. of design, automation, and test in Europe (DATE 2008).
Bougard, B., De Sutter, R., Verkest, D., Van der Perre, L., & Lauwereins, R. (2008). A coarse-grained array accelerator for software-defined radio baseband processing. IEEE Micro, 28(4), 41–50.
Article Google Scholar
Castro, F., Chaver, D., Pinuel, L., Prieto, M., Tirado, F., & Huang, M. (2005). Load-store queue management: An energy-efficient design based on a state-filtering mechanism. In Proceedings. 2005 IEEE international conference on computer design: VLSI in computers and processors, 2005. ICCD 2005, (pp. 617–624), 2–5 Oct. 2005.
Catthoor, F., Danckaert, K., Kulkarni, C., Brockmeyer, E., Kjeldsberg, P., Van Achteren, T., et al. (2002). Data access and storage management for embedded programmable processors. Boston: Kluwer.
MATH Google Scholar
Chen, S., & Postula, A. (2000). Synthesis of custom interleaved memory systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 8(1), 74–83.
Article Google Scholar
Delaluz, V., Kandemir, M. T., Vijaykrishnan, N., Irwin, M. J., Sivasubramaniam, A., & Kolcu, I. (2002). Compiler-directed array interleaving for reducing energy in multi-bank memories. In VLSI design (pp. 288–293).
Derudder, V., Bougard, B., Couvreur, A., Dewilde, A., Dupont, S., Folens, A., et al. (2009). A 200 Mbps + 2.14 nJ/b digital baseband multi processor system-on-chip for SDRs. In Proc of VLSI symposum.
De Sutter, B., Coene, P., Vander Aa, T., & Mei, B. B. (2008). Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on languages, compilers, and tools for embedded systems (pp. 151–160).
Friedman, S., Carroll, A., Van Essen, B., Ylvisaker, B., Ebeling, C., & Hauck, S. S. (2009). SPR: An architecture-adaptive CGRA mapping tool. In FPGA ’09: Proceeding of the ACM/SIGDA international symposium on field programmable gate arrays (pp. 191–200).
Harper, D. (1991). Block, multistride vectorm and FFT accesses in parallel memory systems. IEEE Transactions on Parallel and Distributed Systems, 2(1), 43–51.
Article Google Scholar
Harper, D., & Linebarger, D. (1991). Conflict-free vector access using a dynamic storage scheme. IEEE Transactions on Computers, 40(3), 276–283.
Article Google Scholar
Hive. HiveFlex CSP2000 series, programmable OFDM communication signal processor. http://www.siliconhive.com.
Hur, I., & Lin, C. (2007). Memory scheduling for modern microprocessors. ACM Transactions on Computer Systems, 25(4), 10.
Article Google Scholar
Kandemir, M., Ramanujam, J., & Choudhary, A. (1999). Improving cache locality by a combination of loop and data transformations. IEEE Transactions on Computers, 48(2), 159–167.
Article Google Scholar
Lam, M. S. (1988). Software pipelining: An effecive scheduling technique for VLIW machines. In Proc. PLDI (pp. 318–327).
Mahlke, S., Lin, D., Chen, W. Y., Hank, R., & Bringmann, R. (1992). Effective compiler support for predicated execution using the hyperblock. In MICRO 25: Proceedings of the 25th annual international symposium on microarchitecture (pp. 45–54).
Mei, B., Vernalde, S., Verkest, D., & Lauwereins, R. (2004). Design methodology for a tightly coupled VLIW/reconfigurable matrix arcchitecture: A case study. In Proc. of design, automation and test in Europe (DATE 2004) (pp. 1224–1229).
Mei, B., Vernalde, S., Verkest, D., Man, H. D., & Lauwereins, R. (2003). Exploiting loop-level parallelism for coarse-grained reconfigurable architecture using modulo scheduling. IEE Proceedings: Computer and Digital Techniques, 150(5).
Novo, D., Schuster, T., Bougard, B., Lambrechts, A., Van der Perre, L., & Catthoor, F. (2008). Energy-performance exploration of a CGA-based SDR processor. Journal of Signal Processing Systems.
Oh, T., Egger, B., Park, H., & Mahlke, S. (2009). Recurrence cycle aware modulo scheduling for coarse-grained reconfiguralbe architectures. In Proceedings of the 2009 ACM SIGPLAN-SIGBED conference on languages, compilers, and tools for embedded systems (pp. 21–30).
PACT (2006). PACT XPP technologies. http://www.pactcorp.com.
Park, I., Ooi, C. L., & Vijaykumar, T. N. (2003). Reducing design complexity of the load/store queue. In Proc. of the 36th international symposium on microarchitecture (MICRO-36).
Park, H., Fan, K., Mahlke, S. A. , Oh, T., Kim, H., & Kim, H.-S. (2008). Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In PACT ’08: Proceedings of the 17th international conference on parallel architectures and compilation techniques (pp. 166–176).
Pitkänen, T., Tanskanen, J., Mäkinen, R., & Takala, J. (2008). Parallel memory architecture for application-specific instruction-set processors. Journal of Signal Processing Systems.
Rau, B. R. (1991). Pseudo-randomly interleaved memory. In ISCA ’91: Proceedings of the 18th annual international symposium on computer architecture (pp. 74–83).
Rau, B. R. (1995). Iterative modulo scheduling. Technical report, Hewlett-Packard Lab: HPL-94-115.
Rivers, J. A., Tyson, G. S., Davidson, E. S., & Austin, T. M. (1997) On high-bandwidth data cache design for multi-issue processors. In Proc. of the 30th international symposium on microarchitecture (MICRO-30).
Sethumadhavan, S., Roesner, F., Emer, J. S., Burger, D., & Keckler, S. W. (2007). Late-binding: Enabling unordered load-store queues. In D. M. Tullsen & B. Calder (Eds.), ISCA (pp. 347–357). ACM.
Singh, H., Lee, M.-H., Lu, G., Kurdahi, F. J., Bagherzadeh, N., & Filho, E. M. C. (2000). MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers, 49(5), 465–481.
Article Google Scholar
So, B., Hall, M. W., & Ziegler, H. E. (2004) Custom data layout for memory parallelism. In Proc. of international symposium on code generation and optimization (CGO).
Subramaniam, S., & Loh, G. H. (2006). Fire-and-forget: Load/store scheduling with no store queue at all. In Proceedings of the 39th annual IEEE/ACM international symposium on microarchitecture (pp. 273–284).
Tanskanen, J., & Creutzburg, a. J. N. R. (2005). On design of parallel memory access schemes for video coding. Journal of VLSI Signal Processing Systems, 40, 215–237.
Article Google Scholar
Taylor, M., Kim, J., Miller, J., Wentzla, D., Ghodrat, F., Greenwald, B., et al. (2002). The raw microprocessor: A computational fabric for software circuits and general purpose programs. IEEE Micro, 22(2), 25–35.
Article Google Scholar
Valero, L., Lang, T., Peiron, M., & Ayguadé, E. (1995). Conflict-free access for streams in multimodule memories. IEEE Transactions on Computers, 44(5), 634–646.
Article MATH Google Scholar
van Berkel, K., Heinle, F., Meuwissen, P., Moerman, K., & Weiss, M. (2005). Vector processing as an enabler for software-defined radio in handheld devices. EURASIP Journal on Applied Signal Processing, 2005, 2613–2625.
Article Google Scholar
Wehmeyer, L., & Marwedel, P. (2006). Fast, efficient, and predictable memory accesses. New York: Springer.
MATH Google Scholar
Zhang, Z., Zhu, Z., & Zhang, X. (2000). A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In International symposium on microarchitecture (pp. 32–41).
Zhuang, X., Pande, S., & J. S. G. Jr. (2002). A Framework for parallelizing load/stores on embedded processors. In Proc. of international conference on parallel architectures and compilation techniques (PACT 2002).

Download references

Author information

Authors and Affiliations

Ghent University and Vrije Universiteit Brussel, Sint-Pietersnieuwstraat 41, 9000, Ghent, Belgium
Bjorn De Sutter
Interuniversity Micro-Electronics Center (IMEC), Kapeldreef 75, 3001, Heverlee, Belgium
Osman Allam, Praveen Raghavan, Roeland Vandebriel, Hans Cappelle, Tom Vander Aa & Bingfeng Mei

Authors

Bjorn De Sutter
View author publications
You can also search for this author in PubMed Google Scholar
Osman Allam
View author publications
You can also search for this author in PubMed Google Scholar
Praveen Raghavan
View author publications
You can also search for this author in PubMed Google Scholar
Roeland Vandebriel
View author publications
You can also search for this author in PubMed Google Scholar
Hans Cappelle
View author publications
You can also search for this author in PubMed Google Scholar
Tom Vander Aa
View author publications
You can also search for this author in PubMed Google Scholar
Bingfeng Mei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bjorn De Sutter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Sutter, B., Allam, O., Raghavan, P. et al. An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors. J Sign Process Syst 61, 157–179 (2010). https://doi.org/10.1007/s11265-009-0412-x

Download citation

Received: 25 November 2008
Revised: 22 September 2009
Accepted: 24 September 2009
Published: 14 October 2009
Issue Date: November 2010
DOI: https://doi.org/10.1007/s11265-009-0412-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Abstract

Access this article

Similar content being viewed by others

Application of the Scalable Communications Core as an SDR Baseband

Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

Towards Efficient Dynamic LLC Home Bank Mapping with NoC-Level Support

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Abstract

Access this article

Similar content being viewed by others

Application of the Scalable Communications Core as an SDR Baseband

Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

Towards Efficient Dynamic LLC Home Bank Mapping with NoC-Level Support

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation