Abstract
The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). A detailed explanation of the CSLM algorithm is provided. Implementation was accomplished using a combination of CUBLAS library routines, NVIDIA NPP functions, and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated. The efficiency of the CUDA version of the open source implementation is analyzed and compared to that using the Intel Math Kernel Libraries (MKL) on a variety of CUDA enabled and multi-core CPU platforms. It is demonstrated that substantial performance benefit can be obtained using CUDA, even with nonoptimal code. Techniques for optimizing performance are then provided. Furthermore, an analysis is performed to determine the conditions in which the performance of CUDA exceeds that of the multi-core MKL realization.
Similar content being viewed by others
References
Allada V, Benjegerdes T, Bode B (2009) Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster. In: Proceedings of the IEEE international conference on cluster computing and workshops (CLUSTER), New Orleans, LA, Aug 31–Sept 4, 2009
Franco J, Bernabe G, Fernandez J, Acacio ME (2009) A parallel implementation of the 2D wavelet transform using CUDA. In: Proceedings of the 17th IEEE euromicro international conference on parallel, distributed, and network-based processing (PDP), Weimar, Germany, Feb 18–20, 2009
Phillips EH, Fatica M (2010) Implementing the Himeno benchmark with CUDA on GPU clusters. In: Proceedings of the 24th IEEE international symposium on parallel and distributed processing (IPDPS), Atlanta, GA, Apr 19–23, 2010
Du Z, Yin Z, Bader DA (2010) A tile-based parallel Viterbi algorithm for biological sequence alignment on GPU with CUDA. In: Proceedings of the IEEE international symposium on parallel and distributed processing, workshops, and PhD forum (IPDPSW), Atlanta, GA, Apr 19–23, 2010
Van Der Laan WJ, Jalba AC, Roerdink J (2011) Accelerating wavelet lifting on graphics hardware using CUDA. IEEE Trans Parallel Distrib Syst 22(1):132–146
Han B, Taha TM (2010) Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors. Appl Opt 49(10):B83–B91
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Skadron K (2008) A performance study of general-purpose applications on graphics processors using CUDA. J Parallel Distrib Comput 68(10):1370–1380
Komatitsch D, Michea D, Erlebacher G (2009) Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA. J Parallel Distrib Comput 69(5):451–460
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656
Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans Acoust Speech Signal Process 35(3):400–401
Schwenk H (2010) Continuous-space language models for statistical machine translation. Prague Bull Math Linguist 93:137–146
Schwenk H (2013) CSLM: Continuous Space Language Model toolkit. LIUM, University of Le Mans, France, 11 Sept (2012). www-lium.univ-lemans.fr/cslm/. Accessed 3 Sept 2013
Schwenk H (2007) Continuous space language models. Comput Speech Lang 21:492–518
Schwenk H, Dechelotte D, Gauvain J-L (2006) Continuous space language models for statistical machine translation. In: Proceedings of the joint conference ACL/Coling, July 2006
Whaley RC, Petitet A (2013) Automatically Tuned Linear Algebra Software (ATLAS). SourceForge, 10 July (2012). http://math-atlas.sourceforge.net/. Accessed 3 Sept 2013
Thompson EA, Anderson T (2012) Use of CUDA for the continuous space language model. In: Proceedings of the IEEE high performance extreme computing conference (HPEC), Waltham, MA, Sept 10–12, 2012
Vesely K, Burget L, Grezl F (2010) Parallel training of neural networks for speech recognition. In: Proceedings of the 11th annual conference of the international speech communication association (INTERSPEECH), Mukuhari, Chiba, Japan, Sept 26–30, 2010
Raina R, Madhavan A, Ng AY (2009) Large-scale unsupervised learning using graphics processors. In: Proceedings of the 26th international conference on machine learning (ICML), Montreal, QC, Canada, June 14–18, 2009
Lopes N, Ribeiro B, Goncalves J (2012) Restricted Boltzmann machines and deep belief networks on multi-core processors. In: Proceedings of the 2012 annual international joint conference on neural networks (IJCNN), part of the 2012 IEEE world Congress on computational intelligence (WCCI), Brisbane, QLD, Australia, June 10–15, 2012
NVIDIA Performance Primitives (NPP) version 5.0. 7 Sept 2012. https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NPP_Library.pdf. Accessed 3 Sept 2013
OpenCL programming guide for the CUDA architecture, version 2.3. NVIDIA, 27 Aug 2009. http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingGuide.pdf. Accessed 3 Sept 2013
Intel Math Kernel Library 11.0 (2013). http://software.intel.com/en-us/intel-mkl. Accessed 3 Sept 2013
BLAS (basic linear algebra subprograms). Based upon work supported by the National Science Foundation under Grant No. ASC-9313958 and DOE Grant No. DE-FG0-3-94ER25219, 29 June 2013. http://www.netlib.org/blas/. Accessed 4 Sept 2013
Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17
Multicore CPU: how to disable a core. Kioskea, Aug 2013. http://en.kioskea.net/faq/616-multicore-cpu-how-to-disable-a-core. Accessed 3 Sept 2013
Chen X, Eversole A, Li G, Yu D, Seide F (2012) Pipelined back-propagation for context-dependent deep neural networks. In: Proceedings of the 13th annual conference of the international speech communication association (INTERSPEECH), Portland, OR, Sept 9–13, 2012
Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Orti ES (2008) Evaluation and tuning of the level 3 CUBLAS for graphics processors. In: Proceedings of the 22nd IEEE international parallel and distributed processing symposium (IPDPS), Miami, FL, Apr 14–18, 2008
Acknowledgements
Many thanks to Mike Pressler, IPFW Manager Electronics and Computer Support Services, for his outstanding technical support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Thompson, E.A., Anderson, T.R. A CUDA implementation of the Continuous Space Language Model. J Supercomput 68, 65–86 (2014). https://doi.org/10.1007/s11227-013-1023-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-013-1023-7