Advertisement

MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing

  • Daniel Nichols
  • Nathalie-Sofia Tomov
  • Frank Betancourt
  • Stanimire TomovEmail author
  • Kwai Wong
  • Jack Dongarra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11887)

Abstract

In this paper, we present work towards the development of a new data analytics and machine learning (ML) framework, called MagmaDNN. Our main goal is to provide scalable, high-performance data analytics and ML solutions for scientific applications running on current and upcoming heterogeneous many-core GPU-accelerated architectures. To this end, since many of the functionalities needed are based on standard linear algebra (LA) routines, we designed MagmaDNN to derive its performance power from the MAGMA library. The close integration provides the fundamental (scalable high-performance) LA routines available in MAGMA as a backend to MagmaDNN. We present some design issues for performance and scalability that are specific to ML using Deep Neural Networks (DNN), as well as the MagmaDNN designs towards overcoming them. In particular, MagmaDNN uses well established HPC techniques from the area of dense LA, including task-based parallelization, DAG representations, scheduling, mixed-precision algorithms, asynchronous solvers, and autotuned hyperparameter optimization. We illustrate these techniques and their incorporation and use to outperform other frameworks, currently available.

Keywords

Machine learning High-performance DNN Data-driven scientific computing 

Notes

Acknowledgments

This work was conducted at the Joint Institute for Computational Sciences (JICS) and the Innovative Computing Laboratory (ICL). This work is sponsored by the National Science Foundation (NSF), through NSF REU Award #1659502, with additional Support from the University of Tennessee, Knoxville (UTK), the National Institute for Computational Sciences (NICS), and NSF Awards #1740250 and #1709069. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by NSF grant #ACI-1548562. Computational Resources are available through a XSEDE education allocation awards TG-ASC170031 and TG-ASC190013. In addition, the computing work was also performed on technical workstations donated by the BP High Performance Computing Team, as well as on GPUs donated by NVIDIA.

References

  1. 1.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. CoRR abs/1603.04467 (2016). http://arxiv.org/abs/1603.04467
  2. 2.
    Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Performance, design, and autotuning of batched GEMM for GPUs. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 21–38. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-41321-1_2CrossRefGoogle Scholar
  3. 3.
    Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. CoRR abs/1802.09941 (2018). http://arxiv.org/abs/1802.09941
  4. 4.
    Chen, J., Monga, R., Bengio, S., Józefowicz, R.: Revisiting distributed synchronous SGD. CoRR abs/1604.00981 (2016). http://arxiv.org/abs/1604.00981
  5. 5.
    Chen, S., Gessinger, A., Tomov, S.: Design and acceleration of convolutional neural networks on modern architectures. Technical report, Joint Institute for Computational Sciences (JICS), UTK (2018). 2018 Summer Research Experiences for Undergraduate (REU), Knoxville, TN 2018Google Scholar
  6. 6.
    Chetlur, S., et al.: cuDNN: efficient primitives for deep learning. CoRR abs/1410.0759 (2014). http://arxiv.org/abs/1410.0759
  7. 7.
    Gates, M., Tomov, S., Dongarra, J.: Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs. Parallel Comput. 74, 3–18 (2018).  https://doi.org/10.1016/j.parco.2017.10.004. http://www.sciencedirect.com/science/article/pii/S0167819117301758. Parallel Matrix Algorithms and Applications (PMAA’16)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR abs/1706.02677 (2017). http://arxiv.org/abs/1706.02677
  9. 9.
    Iandola, F.N., Ashraf, K., Moskewicz, M.W., Keutzer, K.: FireCaffe: near-linear acceleration of deep neural network training on compute clusters. CoRR abs/1511.00175 (2015). http://arxiv.org/abs/1511.00175
  10. 10.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. CoRR abs/1408.5093 (2014). http://arxiv.org/abs/1408.5093
  11. 11.
    Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998).  https://doi.org/10.1109/5.726791CrossRefGoogle Scholar
  12. 12.
    Smith, S.L., Kindermans, P., Le, Q.V.: Don’t decay the learning rate, increase the batch size. CoRR abs/1711.00489 (2017). http://arxiv.org/abs/1711.00489
  13. 13.
    Sorna, A., Cheng, X., D’Azevedo, E., Wong, K., Tomov, S.: Optimizing the fast fourier transform using mixed precision on tensor core hardware. In: 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW). pp. 3–7, December 2018.  https://doi.org/10.1109/HiPCW.2018.8634417
  14. 14.
    Tomov, N., Tomov, S.: On deep neural networks for detecting heart disease. CoRR abs/1808.07168 (2018). http://arxiv.org/abs/1808.07168
  15. 15.
    Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36(5), 232–240 (2010).  https://doi.org/10.1016/j.parco.2009.12.005. http://www.sciencedirect.com/science/article/pii/S0167819109001276. Parallel Matrix Algorithms and ApplicationsCrossRefzbMATHGoogle Scholar
  16. 16.
    Tomov, S., Haidar, A., Ayala, A., Schultz, D., Dongarra, J.: Design and implementation for FFT-ECP on distributed accelerated systems. ECP WBS 2.3.3.09 Milestone Report FFT-ECP ST-MS-10-1410, Innovative Computing Laboratory, University of Tennessee, April 2019. 04–2019 revisionGoogle Scholar
  17. 17.
    Wong, K., Brown, L., Coan, J., White, D.: Distributive interoperable executive library (DIEL) for systems of multiphysics simulation. In: 2014 15th International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 49–55. IEEE (2014)Google Scholar
  18. 18.
    You, Y., Gitman, I., Ginsburg, B.: Scaling SGD batch size to 32k for imagenet training. CoRR abs/1708.03888 (2017). http://arxiv.org/abs/1708.03888
  19. 19.
    You, Y., Zhang, Z., Hsieh, C., Demmel, J.: 100-epoch imagenet training with AlexNet in 24 minutes. CoRR abs/1709.05011 (2017). http://arxiv.org/abs/1709.05011

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Daniel Nichols
    • 1
  • Nathalie-Sofia Tomov
    • 1
  • Frank Betancourt
    • 1
  • Stanimire Tomov
    • 1
    Email author
  • Kwai Wong
    • 1
  • Jack Dongarra
    • 1
    • 2
  1. 1.University of TennesseeKnoxvilleUSA
  2. 2.Oak Ridge National LaboratoryOak RidgeUSA

Personalised recommendations