The Journal of Supercomputing

, Volume 68, Issue 1, pp 65–86 | Cite as

A CUDA implementation of the Continuous Space Language Model

  • Elizabeth A. Thompson
  • Timothy R. Anderson


The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). A detailed explanation of the CSLM algorithm is provided. Implementation was accomplished using a combination of CUBLAS library routines, NVIDIA NPP functions, and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated. The efficiency of the CUDA version of the open source implementation is analyzed and compared to that using the Intel Math Kernel Libraries (MKL) on a variety of CUDA enabled and multi-core CPU platforms. It is demonstrated that substantial performance benefit can be obtained using CUDA, even with nonoptimal code. Techniques for optimizing performance are then provided. Furthermore, an analysis is performed to determine the conditions in which the performance of CUDA exceeds that of the multi-core MKL realization.


CUDA CSLM GPU Statistical signal processing CUBLAS Math Kernel Library BLAS High performance computing 



Many thanks to Mike Pressler, IPFW Manager Electronics and Computer Support Services, for his outstanding technical support.


  1. 1.
    Allada V, Benjegerdes T, Bode B (2009) Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster. In: Proceedings of the IEEE international conference on cluster computing and workshops (CLUSTER), New Orleans, LA, Aug 31–Sept 4, 2009 Google Scholar
  2. 2.
    Franco J, Bernabe G, Fernandez J, Acacio ME (2009) A parallel implementation of the 2D wavelet transform using CUDA. In: Proceedings of the 17th IEEE euromicro international conference on parallel, distributed, and network-based processing (PDP), Weimar, Germany, Feb 18–20, 2009 Google Scholar
  3. 3.
    Phillips EH, Fatica M (2010) Implementing the Himeno benchmark with CUDA on GPU clusters. In: Proceedings of the 24th IEEE international symposium on parallel and distributed processing (IPDPS), Atlanta, GA, Apr 19–23, 2010 Google Scholar
  4. 4.
    Du Z, Yin Z, Bader DA (2010) A tile-based parallel Viterbi algorithm for biological sequence alignment on GPU with CUDA. In: Proceedings of the IEEE international symposium on parallel and distributed processing, workshops, and PhD forum (IPDPSW), Atlanta, GA, Apr 19–23, 2010 Google Scholar
  5. 5.
    Van Der Laan WJ, Jalba AC, Roerdink J (2011) Accelerating wavelet lifting on graphics hardware using CUDA. IEEE Trans Parallel Distrib Syst 22(1):132–146 CrossRefGoogle Scholar
  6. 6.
    Han B, Taha TM (2010) Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors. Appl Opt 49(10):B83–B91 CrossRefGoogle Scholar
  7. 7.
    Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Skadron K (2008) A performance study of general-purpose applications on graphics processors using CUDA. J Parallel Distrib Comput 68(10):1370–1380 CrossRefGoogle Scholar
  8. 8.
    Komatitsch D, Michea D, Erlebacher G (2009) Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA. J Parallel Distrib Comput 69(5):451–460 CrossRefGoogle Scholar
  9. 9.
    Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656 CrossRefzbMATHMathSciNetGoogle Scholar
  10. 10.
    Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans Acoust Speech Signal Process 35(3):400–401 CrossRefGoogle Scholar
  11. 11.
    Schwenk H (2010) Continuous-space language models for statistical machine translation. Prague Bull Math Linguist 93:137–146 CrossRefGoogle Scholar
  12. 12.
    Schwenk H (2013) CSLM: Continuous Space Language Model toolkit. LIUM, University of Le Mans, France, 11 Sept (2012). Accessed 3 Sept 2013
  13. 13.
    Schwenk H (2007) Continuous space language models. Comput Speech Lang 21:492–518 CrossRefGoogle Scholar
  14. 14.
    Schwenk H, Dechelotte D, Gauvain J-L (2006) Continuous space language models for statistical machine translation. In: Proceedings of the joint conference ACL/Coling, July 2006 Google Scholar
  15. 15.
    Whaley RC, Petitet A (2013) Automatically Tuned Linear Algebra Software (ATLAS). SourceForge, 10 July (2012). Accessed 3 Sept 2013
  16. 16.
    Thompson EA, Anderson T (2012) Use of CUDA for the continuous space language model. In: Proceedings of the IEEE high performance extreme computing conference (HPEC), Waltham, MA, Sept 10–12, 2012 Google Scholar
  17. 17.
    Vesely K, Burget L, Grezl F (2010) Parallel training of neural networks for speech recognition. In: Proceedings of the 11th annual conference of the international speech communication association (INTERSPEECH), Mukuhari, Chiba, Japan, Sept 26–30, 2010 Google Scholar
  18. 18.
    Raina R, Madhavan A, Ng AY (2009) Large-scale unsupervised learning using graphics processors. In: Proceedings of the 26th international conference on machine learning (ICML), Montreal, QC, Canada, June 14–18, 2009 Google Scholar
  19. 19.
    Lopes N, Ribeiro B, Goncalves J (2012) Restricted Boltzmann machines and deep belief networks on multi-core processors. In: Proceedings of the 2012 annual international joint conference on neural networks (IJCNN), part of the 2012 IEEE world Congress on computational intelligence (WCCI), Brisbane, QLD, Australia, June 10–15, 2012 Google Scholar
  20. 20.
    NVIDIA Performance Primitives (NPP) version 5.0. 7 Sept 2012. Accessed 3 Sept 2013
  21. 21.
    OpenCL programming guide for the CUDA architecture, version 2.3. NVIDIA, 27 Aug 2009. Accessed 3 Sept 2013
  22. 22.
    Intel Math Kernel Library 11.0 (2013). Accessed 3 Sept 2013
  23. 23.
    BLAS (basic linear algebra subprograms). Based upon work supported by the National Science Foundation under Grant No. ASC-9313958 and DOE Grant No. DE-FG0-3-94ER25219, 29 June 2013. Accessed 4 Sept 2013
  24. 24.
    Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17 CrossRefzbMATHGoogle Scholar
  25. 25.
    Multicore CPU: how to disable a core. Kioskea, Aug 2013. Accessed 3 Sept 2013
  26. 26.
    Chen X, Eversole A, Li G, Yu D, Seide F (2012) Pipelined back-propagation for context-dependent deep neural networks. In: Proceedings of the 13th annual conference of the international speech communication association (INTERSPEECH), Portland, OR, Sept 9–13, 2012 Google Scholar
  27. 27.
    Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Orti ES (2008) Evaluation and tuning of the level 3 CUBLAS for graphics processors. In: Proceedings of the 22nd IEEE international parallel and distributed processing symposium (IPDPS), Miami, FL, Apr 14–18, 2008 Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Purdue University Fort WayneFort WayneUSA
  2. 2.Air Force Research Laboratory, 711th Human Performance WingWright Patterson Air Force BaseDaytonUSA

Personalised recommendations