Software Simultaneous Multi-Threading, a Technique to Exploit Task-Level Parallelism to Improve Instruction- and Data-Level Parallelism

  • Daniele Paolo Scarpazza
  • Praveen Raghavan
  • David Novo
  • Francky Catthoor
  • Diederik Verkest
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4148)


The search for energy efficiency in the design of embedded systems is leading toward CPUs with higher instruction-level and data-level parallelism. Unfortunately, individual applications do not have sufficient parallelism to keep all these CPU resources busy. Since embedded systems often consist of multiple tasks, task-level parallelism can be used for the purpose. Simultaneous multi-threading (SMT) proved a valuable technique to do so in high-performance systems, but it cannot be afforded in system with tight energy budgets. Moreover, it does not exploit data-level parallel hardware, and does not exploit the available information on threads.

We propose software-SMT (SW-SMT), a technique to exploit task-level parallelism to improve the utilization of both instruction-level and data-level parallel hardware, thereby improving performance. The technique performs simultaneous compilation of multiple threads at design-time, and it includes a run-time selection of the most efficient mixes.

We have applied the technique to two major blocks of a SDR (software-defined radio) application, achieving energy gains up to 46% on different ILP and DLP architectures. We show that the potentials of SW-SMT increase with SIMD datapath size and VLIW issue width.


Embed System Instruction Level Parallelism Vliw Processor Data Level Parallelism MIMO Receiver 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Philips Research, Philips SiliconHive Avispa Accelerator,
  2. 2.
    Mei, B., Vernalde, S., Verkest, D., Man, H.D., Laurereins, R.: ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In: Proc. of FPL (2003)Google Scholar
  3. 3.
    Lin, Y., Harel, Y., Woh, M., Baron, N., Lee, H., Mahlke, S., Mudge, T., Flautner, K.: A system solution for high-performance, low-power SDR. In: SDR Forum (2005)Google Scholar
  4. 4.
    Lee, H.-S., Lin, Y., Harel, Y., Woh, M., Mahlke, S.A., Mudge, T.N., Flautner, K.: Software defined radio – A high performance embedded challenge. In: Conte, T., Navarro, N., Hwu, W.-m.W., Valero, M., Ungerer, T. (eds.) HiPEAC 2005. LNCS, vol. 3793, pp. 6–26. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  5. 5.
    Berkel, K.V., Heinle, F., Meuwissen, P., Moerman, K., Weiss, M.: Vector processing as an enabler for software-defined radio in handsets from 3G+WLAN onwards. In: Proc. Software Defined Radio Tech. Conf., pp. 125–130 (2004)Google Scholar
  6. 6.
    Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., Smith, B.: The Tera computer system. In: Proc. Intl. Conf. on Supercomputing, pp. 1–6 (1990)Google Scholar
  7. 7.
    Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: Maximizing on-chip parallelism. In: Proc. ISCA, pp. 392–403 (1995)Google Scholar
  8. 8.
    Koufaty, D., Marr, D.T.: Hyperthreading technology in the netburst microarchitecture. IEEE Micro 23(2), 56–65 (2003)CrossRefGoogle Scholar
  9. 9.
    Li, Y., Brooks, D., Hu, Z., Skadron, K., Bose, P.: Understanding the energy efficiency of simultaneous multithreading. In: Proc. ISLPED, pp. 44–49 (2004)Google Scholar
  10. 10.
    van der Horst, M., van Berkel, K., Lukkien, J., Mak, R.: Recursive filtering on a vector DSP with linear speedup. In: Proc. ASAP, pp. 23–25 (2005)Google Scholar
  11. 11.
    Thoen, F., Catthoor, F.: Modeling, Verification and Exploration of Task-level Concurrency in Real-time Embedded Systems. Kluwer Academic Publishing, Dordrecht (1999)Google Scholar
  12. 12.
    Ma, Z., Catthoor, F., Vounckx, J.: Hierarchical task scheduler for interleaving subtasks on heterogeneous multiprocessor platforms. In: Proc. ASP-DAC (2005)Google Scholar
  13. 13.
    Ma, Z.: Interleaved sub-task scheduling on multi-processor SoC. PhD thesis, Katholieke Universiteit Leuven (2006)Google Scholar
  14. 14.
    Parssinen, A.: System design for multi-standard radios. In: Proc. ISSCC (2006)Google Scholar
  15. 15.
    Sasanka, R.: Energy Efficient Support for All levels of Parallelism for Complex Media Applications. PhD thesis, University of Illinois at Urbana-Champaign (2005)Google Scholar
  16. 16.
    Hirata, H., Kimura, K., Nagamine, S., Mochizuki, Y., Nishimura, A., Nakase, Y., Nishizawa, T.: An elementary processor architecture with simultaneous instruction issuing from multiple threads. In: Proc. ISCA, pp. 136–145 (1992)Google Scholar
  17. 17.
    Seng, J.S., Tullsen, D.M., Cai, G.Z.: Power-sensitive multithreaded architecture. In: Proc. ICCD, pp. 199–208 (2000)Google Scholar
  18. 18.
    Corbal, J., Espasa, R., Valero, M.: DLP+TLP processors for the next generation of media workloads. In: Proc. HPCA, pp. 219–228 (2001)Google Scholar
  19. 19.
    Lo, J., Eggers, S., Emer, J., Levy, H., Stamm, R., Tullsen, D.: Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems 15(5), 322–354 (1997)CrossRefGoogle Scholar
  20. 20.
    Özer, E., Conte, T.M., Sharma, S.: Weld: A multithreading technique towards latency-tolerant VLIW processors. In: Monien, B., Prasanna, V.K., Vajapeyam, S. (eds.) HiPC 2001. LNCS, vol. 2228, pp. 192–203. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  21. 21.
    Ferreira, V.M.G., Yasuura, H.: Simultaneous multithreading vliw processor architecture. Technical report, Dept. of Computer Science and Communication Engineering, Kyushu University, Japan (2001)Google Scholar
  22. 22.
    Kaxiras, S., Narlikar, G., Berenbaum, A.D., Hu, Z.: Comparing power consumption of an smt and a cmp dsp for mobile phone workloads. In: Proc. CASES, pp. 211–220 (2001)Google Scholar
  23. 23.
    Op de Beeck, P., Barat, F., Jayapala, M., Lauwereins, R.: CRISP: A template for reconfigurable instruction set processors. In: Brebner, G., Woods, R. (eds.) FPL 2001. LNCS, vol. 2147, p. 296. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  24. 24.
    Trimaran: An Infrastructure for Research in Instruction-Level Parallelism (1999),
  25. 25.
    Cotterell, S., Vahid, F.: Synthesis of customized loop caches for core-based embedded systems. In: Proc. ICCAD (2002)Google Scholar
  26. 26.
    Jayapala, M., Barat, F., Aa, T.V., Catthoor, F., Corporaal, H., Deconinck, G.: Clustered loop buffer organization for low energy VLIW embedded processors. IEEE Transactions on Computers 54(6), 672–683 (2005)CrossRefGoogle Scholar
  27. 27.
    Scarpazza, D.P.: A Source-Level Estimation and Optimization Methodology for the Execution Time and Energy Consumption of Embedded Software. PhD thesis, Politecnico di Milano (May 2006),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Daniele Paolo Scarpazza
    • 1
    • 2
  • Praveen Raghavan
    • 1
    • 3
  • David Novo
    • 1
    • 3
  • Francky Catthoor
    • 1
    • 3
  • Diederik Verkest
    • 1
    • 3
    • 4
  1. 1.IMEC vzwHeverleeBelgium
  2. 2.Dipartimento di Elettronica e InformazionePolitecnico di MilanoItaly
  3. 3.ESATK. U. LeuvenHeverleeBelgium
  4. 4.Electrical EngineeringVrije Universiteit BrusselsBelgium

Personalised recommendations