International Journal of Parallel Programming

, Volume 44, Issue 5, pp 1003–1027 | Cite as

Exploring the Design Space of an Energy-Efficient Accelerator for the SKA1-Low Central Signal Processor

  • Leandro FiorinEmail author
  • Erik Vermij
  • Jan van Lunteren
  • Rik Jongerius
  • Christoph Hagleitner


The Square Kilometre Array (SKA) will be the biggest radio telescope ever built, with unprecedented sensitivity, angular resolution, and survey speed. Collectively, the SKA’s antennas are expected to gather exabytes of data per second and store one petabyte of data every day, requiring exa operations per second for the processing. This paper focuses on the SKA1-Low, the SKA’s aperture-array instrument consisting of 131,072 antennas that will be built in the first phase of the deployment of the project. In particular, our work explores the design of a custom architecture for the central signal processor (CSP) of the SKA1-Low. The CSP processes digitized samples sent by antennas receiving extra-terrestrial radio-frequency signals between 50 and 350 MHz. We describe the challenges in building the CSP, and present a quantitative study for the implementation of a custom hardware architecture for executing the main CSP algorithms. By taking advantage of emerging 3D-stacked-memory devices and by exploring the design space for a 14-nm implementation, we estimate a power consumption of 9.62 W for processing all channels of a sub-band and an energy efficiency at application level of up to 312 GFLOPS/W for our architecture.


SKA HPC Custom computing architectures Accelerators 


  1. 1.
    Balamurugan, G., Kennedy, J., Banerjee, G., Jaussi, J.E., Mansuri, M., O’Mahony, F., Casper, B., Mooney, R.: A scalable 5–15 Gbps, 14–75 mW low-power I/O transceiver in 65 nm CMOS. IEEE J. Solid-State Circuits 43(4), 1010–1019 (2008). doi: 10.1109/JSSC.2008.917522 CrossRefGoogle Scholar
  2. 2.
    Borkar, R., Bohr, M., Jourdan, S.: Advancing Moore’s Law in 2014—The Road to 14 nm. Intel Presentation (2014)Google Scholar
  3. 3.
    Chen, G., Anders, M.A., Kaul, H., Satpathy, S.K., Mathew, S.K., Hsu, S.K., Agarwal, A., Krishnamurthy, R.K., Borkar, S., De, V.: A 340 mV-to-0.9 V 20.2 Tb/s source-synchronous hybrid packet/circuit-switched 16 x 16 network-on-chip in 22 nm tri-gate CMOS. In: Solid-State Circuits Conference Digest of Technical Papers (ISSCC), IEEE International, pp. 276–277 (2014). doi: 10.1109/ISSCC.2014.6757432
  4. 4.
    Clark, B.G.: An efficient implementation of the algorithm ‘CLEAN’. Astron. Astrophys. 89(3), 377–378 (1980)Google Scholar
  5. 5.
    Clark, M.A., La Plante, P.C., Greenhill, L.J.: Accelerating radio astronomy cross-correlation with graphics processing units. Int. J. High Perform. Comput. Appl. 27(2), 178–192 (2013). doi: 10.1177/1094342012444794 CrossRefGoogle Scholar
  6. 6.
    D’Addario, L.R.: Low-power correlator architecture for the mid-frequency SKA, Memo 133. Tech. rep., Jet Propulsion Laboratory, California Institute of Technology (2011)Google Scholar
  7. 7.
    de Souza, L., Bunton, J., Campbell-Wilson, D., Cappallo, R., Kincaid, B.: A radio astronomy correlator optimized for the Xilinx Virtex-4 SX FPGA. In: Field Programmable Logic and Applications (FPL), IEEE International Conference on, pp. 62–67 (2007). doi: 10.1109/FPL.2007.4380626
  8. 8.
    Fiorin, L., Vermij, E., Van Lunteren, J., Jongerius, R., Hagleitner, C.: An energy-efficient custom architecture for the SKA1-low central signal processor. In: Computing Frontiers (CF), 12th ACM International Conference on, pp. 5:1–5:8. ACM, New York (2015). doi: 10.1145/2742854.2742855
  9. 9.
    Galal, S., Horowitz, M.: Energy-efficient floating-point unit design. IEEE Trans. Comput. 60(7), 913–922 (2011). doi: 10.1109/TC.2010.121 MathSciNetCrossRefGoogle Scholar
  10. 10.
    Geraci, J.R., Sacco, S.M.: A transpose-free in-place SIMD optimized FFT. ACM Trans. Archit. Code Optim. (TACO) 9(3), 23:1–23:21 (2012). doi: 10.1145/2355585.2355596 Google Scholar
  11. 11.
    Giridhar, B., Cieslak, M., Duggal, D., Dreslinski, R., Chen, H.M., Patti, R., Hold, B., Chakrabarti, C., Mudge, T., Blaauw, D.: Exploring DRAM organizations for energy-efficient and resilient exascale memories. In: High Performance Computing, Networking, Storage and Analysis (SC), International Conference for, pp. 1–12 (2013). doi: 10.1145/2503210.2503215
  12. 12.
    He, Y., Pu, Y., Kleihorst, R., Ye, Z., Abbo, A.A., Londono, S.M., Corporaal, H.: Xetal-Pro: an ultra-low energy and high throughput SIMD processor. In: Design Automation Conference (DAC), 47th ACM/IEEE, pp. 543–548 (2010)Google Scholar
  13. 13.
    Hybrid Memory Cube Consortium: Hybrid Memory Cube specification 2.0 (2014)Google Scholar
  14. 14.
    ITRS Committee: International Technology Roadmap for Semiconductors, 2012 Update (2012).
  15. 15.
    Jayasena, N., Erez, M., Ahn, J.H., Dally, W.J.: Stream register files with indexed access. In: High-Performance Computer Architecture (HPCA), 10th IEEE International Symposium on, pp. 60–71 (2004). doi: 10.1109/HPCA.2004.10007
  16. 16.
    Jeddeloh, J., Keeth, B.: Hybrid Memory Cube: new DRAM architecture increases density and performance. In: VLSI Technology (VLSIT), 2012 Symposium on, pp. 87–88 (2012). doi: 10.1109/VLSIT.2012.6242474
  17. 17.
    Jongerius, R., Corporaal, H., Broekema, C., Engbersen, T.: Analyzing LOFAR station processing on multi-core platforms. In: ICT Open 2012 (2012).
  18. 18.
    Jongerius, R., Wijnholds, S., Nijboer, R., Corporaal, H.: An end-to-end computing model for the square kilometre array. Computer 47(9), 48–54 (2014). doi: 10.1109/MC.2014.235 CrossRefGoogle Scholar
  19. 19.
    Karner, H., Auer, M., Ueberhuber, C.W.: Top speed FFTs for FMA architectures. Tech. rep., Institute for Applied and Numerical Mathematics, Technical University of Vienna, Austria (1998)Google Scholar
  20. 20.
    Lippert, T., Petkov, N., Palazzari, P., Schilling, K.: Hyper-systolic matrix multiplication. Parallel Comput. 27(6), 737–759 (2001). doi: 10.1016/S0167-8191(00)00108-3 MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Nair, R., Antao, S., Bertolli, C., Bose, P., Brunheroto, J., et al.: Active memory cube: a processing-in-memory architecture for exascale systems. IBM J. Res. Dev. 59(2/3), 17:1–17:14 (2015). doi: 10.1147/JRD.2015.2409732 CrossRefGoogle Scholar
  22. 22.
    Pedram, A., McCalpin, J., Gerstlauer, A.: Transforming a linear algebra core to an FFT accelerator. In: Application-Specific Systems, Architectures and Processors (ASAP), 24th IEEE International Conference on, pp. 175–184 (2013). doi: 10.1109/ASAP.2013.6567572
  23. 23.
    Pugsley, S., Jestes, J., Zhang, H., Balasubramonian, R., Srinivasan, V., et al.: NDC: analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In: Performance Analysis of Systems and Software (ISPASS), IEEE International Symposium on, pp. 190–200 (2014). doi: 10.1109/ISPASS.2014.6844483
  24. 24.
    Romein, J.W., Broekema, P.C., Mol, J.D., van Nieuwpoort, R.V.: The LOFAR correlator: implementation and performance analysis. In: Principles and Practice of Parallel Programming (PPoPP), 15th ACM SIGPLAN Symposium on, pp. 169–178 (2010). doi: 10.1145/1693453.1693477
  25. 25.
    SKA organisation: Square Kilometer Array. ”
  26. 26.
    Thoziyoor, S., Ahn, J., Monchiero, M., Brockman, J., Jouppi, N.: A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In: Computer Architecture (ISCA), ACM/IEEE 35th International Symposium on, pp. 51–62 (2008). doi: 10.1109/ISCA.2008.16
  27. 27.
    Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., et al.: An 80-tile sub-100-W teraFLOPS processor in 65-nm CMOS. Solid-State Circuits IEEE J. 43(1), 29–41 (2008). doi: 10.1109/JSSC.2007.910957 CrossRefGoogle Scholar
  28. 28.
    van Lunteren, J.: Towards memory centric computing: a flexible address mapping scheme. In: Electrical and Computer Engineering, IEEE Canadian Conference on, vol. 1, pp. 385–390 (1999). doi: 10.1109/CCECE.1999.807229
  29. 29.
    van Lunteren, J.: High-performance pattern-matching for intrusion detection. In: INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Proceedings, pp. 1–13 (2006). doi: 10.1109/INFOCOM.2006.204
  30. 30.
    van Lunteren, J.: A novel processor architecture for high-performance stream processing. In: High Performance Chips (HC), Hot Chips: A Symposium on (2006).
  31. 31.
    van Lunteren, J.: Memory-driven near-data acceleration and its application to DOME/SKA. Presentation at the 2014 HPC User Forum (2014).
  32. 32.
    van Nieuwpoort, R.V., Romein, J.W.: Correlating radio astronomy signals with many-core hardware. Int. J. Parall. Program. 39(1), 88–114 (2011). doi: 10.1007/s10766-010-0144-3 CrossRefGoogle Scholar
  33. 33.
    Vermij, E., Fiorin, L., Hagleitner, C., Bertels, K.: Exascale radio astronomy: can we ride the technology wave? In: Kunkel, J., Ludwig, T., Meuer, H. (eds.) Supercomputing, Lecture Notes in Computer Science, vol. 8488, pp. 35–52. Springer International Publishing (2014). doi: 10.1007/978-3-319-07518-1_3
  34. 34.
    Vermij, E., Fiorin, L., Jongerius, R., Hagleitner, C., Bertels, K.: Challenges in exascale radio astronomy: can the SKA ride the technology wave? Int. J. High Perform. Comput. Appl. 29(1), 37–50 (2015). doi: 10.1177/1094342014549059 CrossRefGoogle Scholar
  35. 35.
    Waeijen, L., She, D., Corporaal, H., He, Y.: SIMD made explicit. In: Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), International Conference on, pp. 330–337 (2013). doi: 10.1109/SAMOS.2013.6621142

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Leandro Fiorin
    • 1
    Email author
  • Erik Vermij
    • 1
  • Jan van Lunteren
    • 2
  • Rik Jongerius
    • 1
  • Christoph Hagleitner
    • 2
  1. 1.IBM ResearchDwingelooThe Netherlands
  2. 2.IBM Research – ZurichZurichSwitzerland

Personalised recommendations