Skip to main content

Algorithm Design for Tensor Units

  • Conference paper
  • First Online:
Euro-Par 2021: Parallel Processing (Euro-Par 2021)

Abstract

To respond to the intense computational load of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model.

A preliminary draft appeared as brief announcement at SPAA 2020 [5]. This work was partially supported by NSF grant CNS-1553510, UniPD SID18 grant, PRIN17 20174LF3T8 AHeAd, UniBZ-CRC 2019-IN2091 Project, and INdAM-GNCS Project 2020 NoRMA. Some results are based upon work performed at the AlgoPARC Workshop on Parallel Algorithms and Data Structures, in part supported by NSF Grant CCF-1930579.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We observe that \(\omega _0\) corresponds to \(\omega /2\), where \(\omega \) is the traditional symbol used for denoting the exponent in fast matrix multiplication algorithms.

  2. 2.

    With a slight abuse of notation, given two \(n\times n\) matrices A and B with n even, we define \((A \circledast B)[i,j] = \sum _{\alpha ,\beta \in [-n/2,n/2)} A[(i+\alpha )\mod n,(j+\beta )\mod n]W[n/2-\alpha , n/2-\beta ]\). In the paper, we omit the mod operation from the notation.

References

  1. Afshani, P., Sitchinava, N.: Sorting and permuting without bank conflicts on GPUs. In: Proceedings European Symposium on Algorithms (ESA), pp. 13–24 (2015)

    Google Scholar 

  2. Ahle, T.D., Silvestri, F.: Similarity search with tensor core units. In: Proceedings of the 13th International Conference on Similarity Search and Application (SISAP), vol. 12440, pp. 76–84 (2020)

    Google Scholar 

  3. Ahmad, Z., Chowdhury, R., Das, R., Ganapathi, P., Gregory, A., Zhu, Y.: Fast stencil computations using fast Fourier transforms. In: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2021)

    Google Scholar 

  4. Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM 59(6), 32:1–32:23 (2013)

    Google Scholar 

  5. Chowdhury, R., Silvestri, F., Vella, F.: Brief announcement: a computational model for tensor core units. In: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2020)

    Google Scholar 

  6. Chowdhury, R.A., Silvestri, F., Vella, F.: A computational model for tensor core units arxiv preprint arxiv: 1908.06649 (2020)

  7. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. The MIT Press, Cambridge (2001)

    Google Scholar 

  8. Dakkak, A., Li, C., Xiong, J., Gelado, I., Hwu, W.M.: Accelerating reduction and scan using tensor core units. In: Proceedings of the International Conference on Supercomputing (ICS), pp. 46–57 (2019)

    Google Scholar 

  9. Firoz, J.S., Li, A., Li, J., Barker, K.: On the feasibility of using reduced-precision tensor core operations for graph analytics. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2020)

    Google Scholar 

  10. Jacob, R., Stöckel, M.: Fast output-sensitive matrix multiplication. In: Proceedings of European Symposium on Algorithms (ESA), pp. 766–778 (2015)

    Google Scholar 

  11. Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th International Symposium on Computer Architecture (ISCA), pp. 1–12 (2017)

    Google Scholar 

  12. Karatsuba, A., Ofman, Y.: Multiplication of multidigit numbers on automata. Soviet Physics Doklady 7, 595 (1963)

    Google Scholar 

  13. Karsin, B., Weichert, V., Casanova, H., Iacono, J., Sitchinava, N.: Analysis-driven engineering of comparison-based sorting algorithms on GPUs. In: Proceedings of the 32nd International Conference on Supercomputing (ICS), pp. 86–95 (2018)

    Google Scholar 

  14. Kleinberg, J., Tardos, E.: Algorithm Design. Addison Wesley, Boston (2006)

    Google Scholar 

  15. Lu, T., Chen, Y., Hechtman, B.A., Wang, T., Anderson, J.R.: Large-scale discrete Fourier transform on TPUs. In: arXiv preprint arXiv: 2002.03260

  16. Nvidia Tesla V100 GPU architecture. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

  17. Raz, R.: On the complexity of matrix product. SIAM J. Comput. 32(5), 1356–1369 (2003)

    Article  MathSciNet  Google Scholar 

  18. Seidel, R.: On the all-pairs-shortest-path problem in unweighted undirected graphs. J. Comput. Syst. Sci. 51(3), 400–403 (1995)

    Article  MathSciNet  Google Scholar 

  19. Sorna, A., Cheng, X., D’Azevedo, E., Won, K., Tomov, S.: Optimizing the fast Fourier transform using mixed precision on tensor core hardware. In: Proceedings of the 25th International Conference on High Performance Computing Workshops (HiPCW), pp. 3–7 (2018)

    Google Scholar 

  20. Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13(4), 354–356 (1969). https://doi.org/10.1007/BF02165411

  21. Vitter, J.S.: Algorithms and data structures for external memory. Found. Trends Theor. Comput. Sci. 2(4), 305–474 (2006)

    Article  MathSciNet  Google Scholar 

  22. Zachariadis, O., Satpute, N., Gómez-Luna, J., Olivares, J.: Accelerating sparse matrix–matrix multiplication with GPU tensor cores. Comput. Electr. Eng. 88, 106848 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Silvestri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chowdhury, R., Silvestri, F., Vella, F. (2021). Algorithm Design for Tensor Units. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85665-6_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85664-9

  • Online ISBN: 978-3-030-85665-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics