Abstract
This chapter introduces the concept of executing multiple incompatible loops in parallel and thereby enabling multi-threading in an efficient way in a VLIW processor. The proposed multi-threading is enabled by the use of a distributed instruction memory organization with a minimal hardware overhead. This forms one of the core contributions of this book. It also shows how the proposed instruction memory hierarchy extension can both improve performance as well as reduce the energy consumption compared to state-of-the-art simultaneous multi-threaded (SMT) architectures over various DSP benchmarks. The chapter also shows that the proposed architecture can be compiled for.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
R.Banakar, S.Steinke, B.Lee, M.Balakrishnan, and P.Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. Proc. of the 10th Intnl. Symposium on Hardware/software Codesign, CODES’02, pages 73–78, May 2002.
M.Kandemir, I.Kadayif, A.Choudhary, J.Ramanujam, and I.Kolcu. Compiler-directed scratch pad memory optimization for embedded multiprocessors. IEEE Trans on VLSI, pages 281–287, March 2004.
S.Rixner, W.Dally, B.Khialany, P.Mattson, U.Kapasi, and J.Owens. Register organization for media processing. Proc. of 26th Intnl. Symposium on High-Performance Computer Architecture (HiPC), pages 375–386, January 2000.
V.Lapinskii, M.Jacome, and G.de Veciana. Application-specific clustered vliw datapaths: Early exploration on a parameterized design space. IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, 21(8):889–903, August 2002.
M.Jayapala, F.Barat, T.Vander Aa, F.Catthoor, H.Corporaal, and G.Deconinck. Clustered loop buffer organization for low energy VLIW embedded processors. IEEE Trans. on Computers, 54(6):672–683, June 2005.
A.Halambi, A.Shrivastava, P.Biswas, N.Dutt, and A.Nicolau. An efficient compiler technique for code size reduction using reduced bit-width isas. Proc. of Design Automation Conf. (DAC), March 2002.
T.Conte, S.Banerjia, S.Larin, and K.Menezes. Instruction fetch mechanisms for vliw architectures with compressed encodings. Proc. of 29th Intnl. Symposium on Microarchitecture (MICRO), December 1996.
H.DeMan. Ambient intelligence: Giga-scale dreams and nano-scale realities. Proc. of ISSCC, Keynote Speech, February 2005.
Texas Instruments, Inc, http://www.ti.com. TMS320C6000 CPU and Instruction Set Reference Guide, October 2000.
STMicroelectronics, http://www.st.com. ST120 DSP-MCU Programming Manual, December 2000.
AT & T, http://www.att.com. AT&T DSP1600 Microprocessor, 1990.
J.Eyre and J.Bier. Infineon’s tricore tackles dsp. Article 13/5, Microprocessor Report, April 1999.
W.Dally. Low power architectures. IEEE Intnl. Solid State Circuits Conf., Panel Talk on “When Processors Hit the Power Wall”, February 2005.
M.Joshi, NS. Nagaraj, and A.Hill. Impact of interconnect scaling and process variations on performance. Proc. of CMOS Emerging Technologies, 2006.
D.Sylvester and K.Keutzer. Getting to the bottom of deep submicron ii: a global wiring paradigm. ISPD ’99: Proceedings of the 1999 international symposium on Physical design, pages 193–200, New York, NY, USA, 1999. ACM.
M.Jayapala. Low Energy Instruction Memory Organization. Doctoral dissertation, ESAT/EE Dept., K.U.Leuven, Belgium, Sep. 2005.
T.Van der Aa. Low Energy Instruction Memory Exploration. PhD thesis, KULeuven, ESAT/ELECTA, 2005.
F.Quillere, S.Rajopadhye, and D.Wilde. Generation of efficient nested loops from polyhedra. Intl. Journal on Parallel Programming, 2000.
J.Gómez, P.Marchal, S.Verdoorlaege, L.Piñuel, and F.Catthoor. Optimizing the memory bandwidth with loop morphing. Proc. of ASAP wsh, pages 213–223, 2004.
M.Palkovic, E.Brockmeyer, P.Vanbroekhoven, H.Corporaal, F.Catthoor, “Systematic Preprocessing of Data Dependent Constructs for Embedded Systems”, Proc. IEEE Wsh. on Power and Timing Modeling, Optimization and Simulation (PATMOS), Leuven, Belgium, Lecture Notes Comp. Sc., Springer-Verlag, Vol.3728, pp.89–90, Sep. 2005.
S.Cotterell and F.Vahid. Synthesis of customized loop caches for core-based embedded systems. Proc. of Intnl. Conf. on Computer Aided Design (ICCAD), November 2002.
J.Sias, H.Hunter, and W.Hwu. Enhancing loop buffering of media and telecommunications applications using low-overhead predication. Proc. of 34th Annual Intnl. Symposium on Microarchitecture (MICRO), December 2001.
S.Steinke, L.Wehmeyer, B.Lee, and P.Marwedel. Assigning program and data objects to scratchpad for energy reduction. Design Automation and Test in Europe (DATE), pages 409–414, March 2002.
Y.Kobayashi. Low Power Design Method for Embedded Systems Using VLIW Processor. PhD thesis, Graduate School of Inforamation Science and Technology at Osaka University, July 2007.
Starcore DSP Techology, http://www.starcore-dsp.com. SC140 DSP Core Reference Manual, June 2000.
D.Scarpazza, P.Raghavan, D.Novo, F.Catthoor, and D.Verkest. Software simultaneous multi-threading, a technique to exploit task-level parallelism to improve instruction- and data-level parallelism. Proc. of PATMOS. Springer Verlag LNCS, Sep. 2006.
E.Ozer and T.M.Conte. High-performance and low-cost dual thread VLIW processor using weld architecture paradigm. IEEE Trans. on Parallel and Distributed Systems, volume 16(12), December 2005.
S.Kaxiras, G.Narlikar, A.Berenbaum, and Z.Hu. Comparing power consumption of an SMT and a CMP DSP for mobile phone workloads. Proc. of Intnl. Conf. on Compilers, Architecture, and Synthesis for Smbedded Systems (CASES), pages 211–220, November 2001.
D.Tullsen, S.Eggers, and H.Levy. Simultaneous multithreading: Maximizing on-chip parallelism. Proc. of Intnl. Symposium on Computer Architecture (ISCA), pages 392–403, June 1995.
ARM, http://www.arm.com/products/physicalip/memory.html. Artisan Memory Generator.
TI DSP Benchmark Suite. http://focus.ti.com/docs/toolsw/folders/print/sprc092.html, 2009.
M.Palkovic and A.Folens. Mapping of the 40mhz WLAN SDM receiver on the FLAI ADRES baseband engine. Apollo Deliverable 200803_DE_SDR_BB_D41̇, IMEC vzw, April 2008.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media B.V.
About this chapter
Cite this chapter
Catthoor, F., Raghavan, P., Lambrechts, A., Jayapala, M., Kritikakou, A., Absar, J. (2010). Multi-threading in Uni-threaded Processor. In: Ultra-Low Energy Domain-Specific Instruction-Set Processors., vol 0. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9528-2_6
Download citation
DOI: https://doi.org/10.1007/978-90-481-9528-2_6
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-9527-5
Online ISBN: 978-90-481-9528-2
eBook Packages: EngineeringEngineering (R0)