Dennard, R., Gaensslen, F., Rideout, V., Bassous, E., LeBlanc, A.: Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid State Circuit 9(5), 256–268 (1974)
Article
Google Scholar
Moore, G.: Cramming more components onto integrated circuits. Electronics 38(8), 114–117 (1965)
Google Scholar
Duranton, M., et al.: The HiPEAC vision for advanced computing in horizon 2020. http://www.hipeac.net/roadmap (2013)
Lavignon J.F., et al.: ETP4HPC strategic research agenda achieving HPC leadership in Europe. (2013)
Lucas R., et al.: Top ten Exascale research challenges. http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14 (2014)
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: Proceedings 38th Annual International Symposium on Computer Architecture, ISCA’11, pp. 365–376 (2011)
Göddeke, D., Komatitsch, D., Geveler, M., Ribbrock, D., Rajovic, N., Puzovic, N., Ramirez, A.: Energy efficiency vs. performance of the numerical solution of PDEs: An application study on a low-power ARM-based cluster. J. Comput. Phys. 237(0), 132–150 (2013). doi:10.1016/j.jcp.2012.11.031
http://www.sciencedirect.com/science/article/pii/S0021999112007115
Rajovic, N., Carpenter, P.M., Gelado, I., Puzovic, N., Ramirez, A., Valero, M.: Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 40:1–40:12. ACM, New York, NY, USA (2013). doi:10.1145/2503210.2503281
The TOP500 list. http://www.top500.org (2015)
The GREEN500 list. http://www.green500.org (2015)
Hill, M., Marty, M.: Amdahl’s law in the multicore era. Computer 41(7), 33–38 (2008)
Article
Google Scholar
Kumar, R., Tullsen, D.M., Ranganathan, P., Jouppi, N.P., Farkas, K.I.: Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In: Proceedings 31st Annual International Symposium on Computer Architecture, ISCA’04, p. 64 (2004)
Morad, T., Weiser, U., Kolodny, A., Valero, M., Ayguade, E.: Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Comput. Arch. Lett. 5(1), 14–17 (2006)
Winter, J.A., Albonesi, D.H., Shoemaker, C.A.: Scalable thread scheduling and global power management for heterogeneous many-core architectures. In: Proceeding 19th International Conference Parallel Architectures and Compilation Techniques, PACT’10, pp. 29–40 (2010)
Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
Article
MATH
Google Scholar
Kågström, B., Ling, P., van Loan, C.: GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24(3), 268–302 (1998)
Article
MATH
Google Scholar
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: a view from Berkeley. Technical report UCB/EECS-2006-183, University of California at Berkeley, Electrical Engineering and Computer Sciences (2006)
Intel Corp.: Intel math kernel library (MKL) 11.0. http://software.intel.com/en-us/intel-mkl (2015)
AMD: AMD core math library. http://developer.amd.com/tools/cpu/acml/pages/default.aspx (2015)
IBM: Engineering and scientific subroutine library. http://www.ibm.com/systems/software/essl/ (2015)
NVIDIA: CUDA basic linear algebra subprograms. https://developer.nvidia.com/cuBLAS (2015)
Goto, K., van de Geijn, R.: Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3), 12:1–12:25 (2008)
Goto, K., van de Geijn, R.: High performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 4:1–4:14 (2008). doi:10.1145/1377603.1377607
OpenBLAS. http://xianyi.github.com/OpenBLAS/ (2015)
Van Zee, F.G., van de Geijn, R.A.: BLIS: A framework for generating BLAS-like libraries. ACM Trans. Math. Softw. (2016), (To appear)
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of SC’98 (1998)
Chitlur, N., Srinivasa, G., Hahn, S., Gupta, P., Reddy, D., Koufaty, D., Brett, P., Prabhakaran, A., Zhao, L., Ijih, N., Subhaschandra, S., Grover, S., Jiang, X., Iyer, R.: Quickia: Exploring heterogeneous architectures on real prototypes. In: High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pp. 1–8 (2012). doi:10.1109/HPCA.2012.6169046
Chitlur, N., Srinivasa, G., Hahn, S., Gupta, P.K., Reddy, D., Koufaty, D., Brett, P., Prabhakaran, A., Zhao, L., Ijih, N., Subhaschandra, S., Grover, S., Jiang, X., Iyer, R.: Quickia: Exploring heterogeneous architectures on real prototypes. In: Proceedings IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA’12, pp. 1–8 (2012)
Hourd, J., Fan, C., Zeng, J., Zhang, Q.S., Best, M.J., Fedorova, A., Mustard, C.: Exploring practical benefits of asymmetric multicore processors. In: 2nd Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures, PESPMA (2009)
Lakshminarayana, N.B., Lee, J., Kim, H.: Age based scheduling for asymmetric multiprocessors. In: Proceedings Conference on High Performance Computing Networking, Storage and Analysis, SC’09, pp. 25:1–25:12 (2009)
Rodrigues, R., Annamalai, A., Koren, I., Kundu, S.: Improving performance per watt of asymmetric multi-core processors via online program phase classification and adaptive core morphing. ACM Trans. Des. Autom. Electron. Syst. 18(1), 5:1–5:23 (2013)
Clarke, D., Lastovetsky, A., Rychkov, V.: Column-based matrix partitioning for parallel matrix multiplication on heterogeneous processors based on functional performance models. In: Euro-Par 2011: Parallel Processing Workshops, LNCS, vol. 7155, pp. 450–459 (2012)
Beaumont, O., Marchal, L.: Analysis of dynamic scheduling strategies for matrix multiplication on heterogeneous platforms. In: Proceedings 23rd International Symposium High-performance Parallel and Distributed Computing, HPDC’14, pp. 141–152 (2014)
Low, T.M., Igual, F.D., Smith, T.M., Quintana-Ortí, E.S.: Analytical modeling is enough for high performance BLIS. Technical report FLAWN #74, Department of Computer Sciences, The University of Texas at Austin ACM Trans. Math. Softw. (2014). http://www.cs.utexas.edu/users/flame/
Van Zee, F.G., Smith, T.M., Marker, B., Low, T.M., van de Geijn, R.A., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J., Killough, L.: The BLIS framework: Experiments in portability. ACM Trans. Math. Softw. (2014). In review. http://www.cs.utexas.edu/users/flame
Smith, T.M., van de Geijn, R., Smelyanskiy, M., Hammond, J.R., Van Zee, F.G.: Anatomy of high-performance many-threaded matrix multiplication. In: Proceedings IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS’14, pp. 1049–1059 (2014)
Alonso, P., Badia, R.M., Labarta, J., Barreda, M., Dolz, M.F., Mayo, R., Quintana-Ortí, E.S., Reyes, R.: Tools for power-energy modelling and analysis of parallel scientific applications. In: 41st International Conference on Parallel Processing—ICPP, pp. 420–429 (2012)