Skip to main content

Vectorizable Design and Implementation of Matrix Multiplication on Vector Processor

  • Conference paper
  • First Online:
  • 1219 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 553))

Abstract

Matrix-vector multiplication is one of the core computing of many algorithms calculation in scientific computing, the vectorization algorithm mapping is a difficult problem to vector processors. In this study, based on the background of BP algorithm for deep learning application, on the basis of in-depth analysis of the BP algorithm, according to the characteristics of vector processor architecture, we proposed an efficient vectorization method of matrix-vector multiplication. The L1D configured into SRAM mode, with double buffer “ping-pong” way to smooth data transmission of multistage storage structure, makes the calculation of the kernel and the DMA data moving overlap, let the kernel run at a peak speed, so as to achieve the best calculation efficiency. Through the way of transpose matrix transmission with DMA to avoid the inefficient access to column of matrix and summation reduction of floating-point calculation between the VPEs, Obtain the optimal kernel computing performance. Experimental result on MATRIX2 shows that the single-core performance of presented double precision matrix multiplication achieves 94.45 GFLOPS, and the efficiency of kernel computation achieves 99.39%.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. LIU Zhong, TIAN Xi, CHEN Lei. Efficient vectorization method of triangular matrix multiplication supporting in-place calculation [J]. Journal of National University of Defense Technology, 2014(6):7–11.

    Google Scholar 

  2. LIU Zhong, CHEN Yueyue, CHEN Haiyan. A vectorization of FIR filter supporting any length and data types of coefficients [J]. Acta Electronics Sinica, 2013, 41(2):346–351. (in Chinese).

    Google Scholar 

  3. J.J. DONGARRA, JEREMY DU CROZ, SVEN HAMMARLING, RICHARD J. HANSON, An Extended Set of FORTRAN Basic Linear Algebra Subprograms [J], ACM Transactions on Mathematical Software, Vol. 14, No. 1, March 1973, Pages 1–17.

    Google Scholar 

  4. GotoBLASHomepage. [EB/OI]. [2014-04-24]. http://www.tacc.utexas.edu/tacc-projects/gotoblas2.

  5. Goto K, van de Geijn R A. High-performance implementation of the level-3 BLAS[J]. ACM Transactions on Mathematical Software, 2008, 35(1):1–14.

    Google Scholar 

  6. ATLASHomepage. [EB/OL]. [2014-04-24]. http://math-atlas.SourceForge.net/.

  7. Intel MKL Homepage [EB/OL]. [2014-04-24]. http://software.intel.com/en-us/articles/intel-mkl/.

  8. ZHANG Xianyi, WANG Qian, ZHANG Yunquan. OpenBLAS: a high performance BLAS library on loongson 3A CPU [J]. journal of Software, 2011, 22(zk2):208–216. (in Chinese).

    Google Scholar 

  9. H. Esmaeilzadeh, P. Saeedi, B.N. Araabi, C. Lucas, and Sied Mehdi Fakhraie. Neural network stream processing core (NnSP) for embedded systems. In Proceedings of the 2006 IEEE International Symposium on Circuits and Systems, 2006.

    Google Scholar 

  10. V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.

    Google Scholar 

  11. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going Deeper with Convolutions. In arXiv:1409.4842, 2014.

  12. Zhao Z. Study and Application of BP Neural Network in Intrusion Detection[M] Proceedings of the 2012 International Conference on Cybernetics and Informatics. Springer New York, 2014:379–385.

    Google Scholar 

  13. Y.K. Li, “Analysis and Improvement Application of BP Neural Network,” Anhui University of Science and Technology, 2012.

    Google Scholar 

  14. Y.M. Li, “The Study of BP Learning Algorithm Improvement and Application in Face Recognition,” Shandong University, 2012.

    Google Scholar 

Download references

Acknowledgements

This paper is supported by the National Natural Science Foundation of China (61133007 and 61572025)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junyang Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Zhang, J., Guo, Y., Hu, X. (2017). Vectorizable Design and Implementation of Matrix Multiplication on Vector Processor. In: Bhatia, S., Mishra, K., Tiwari, S., Singh, V. (eds) Advances in Computer and Computational Sciences. Advances in Intelligent Systems and Computing, vol 553. Springer, Singapore. https://doi.org/10.1007/978-981-10-3770-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3770-2_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3769-6

  • Online ISBN: 978-981-10-3770-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics