Divide-and-Conquer Parallelism for Learning Mixture Models

Kawakatsu, Takaya; Kinoshita, Akira; Takasu, Atsuhiro; Adachi, Jun

doi:10.1007/978-3-662-53455-7_2

Divide-and-Conquer Parallelism for Learning Mixture Models

Takaya Kawakatsu¹⁷,
Akira Kinoshita¹⁷,
Atsuhiro Takasu¹⁸ &
…
Jun Adachi¹⁸

Chapter
First Online: 10 September 2016

503 Accesses
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9940))

Abstract

From the viewpoint of load balancing among processors, the acceleration of machine-learning algorithms by using parallel loops is not realistic for some models involving hierarchical parameter estimation. There are also other serious issues such as memory access speed and race conditions. Some approaches to the race condition problem, such as mutual exclusion and atomic operations, degrade the memory access performance. Another issue is that the first-in-first-out (FIFO) scheduler supported by frameworks such as Hadoop can waste considerable time on queuing and this will also affect the learning speed. In this paper, we propose a recursive divide-and-conquer-based parallelization method for high-speed machine learning. Our approach exploits a tree structure for recursive tasks, which enables effective load balancing. Race conditions are also avoided, without slowing down the memory access, by separating the variables for summation. We have applied our approach to tasks that involve learning mixture models. Our experimental results show scalability superior to FIFO scheduling with an atomic-based solution to race conditions and robustness against load imbalance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://www.mpi-forum.org.
2.
http://www.openmp.org.
3.
http://hadoop.apache.org.
4.
http://www.threadingbuildingblocks.org.
5.
http://computing.llnl.gov/tutorials/pthreads/.
6.
http://golang.org.
7.
http://www.cplusplus.com/reference/atomic/atomic.
8.
https://www.kernel.org.
9.
http://www.centos.org.
10.
http://gcc.gnu.org.

References

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenkerand, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, June 2010
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., MacCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, April 2012
Google Scholar
Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, October 2010
Google Scholar
Huang, C., Chen, Q., Wang, Z., Power, R., Ortiz, J., Li, J., Xiao, Z.: Spartan: a distributed array framework with smart tiling. In: Proceedings of the USENIX Annual Technical Conference, July 2015
Google Scholar
Dijkstra, E.W.: Cooperating sequential processes. EWD: EWD123 (1968)
Google Scholar
Mohr, E., Kranz Jr., D.A., Halstead, R.H.: Lazy task creation: a technique for increasing the granularity of parallel programs. In: Proceedings of the 1990 ACM Conference on LISP and Functional Programming, May 1990
Google Scholar
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, August 1995
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, Hoboken (2008)
Book MATH Google Scholar
Kinoshita, A., Takasu, A., Adachi, J.: Traffic incident detection using probabilistic topic model. In: Proceedings of the Workshops of the EDBT/ICDT 2014 Joint Conference, March 2014
Google Scholar
Kinoshita, A., Takasu, A., Adachi, J.: Real-time traffic incident detection using a probabilistic topic model. Inf. Syst. 54(C), 169–188 (2015)
Article Google Scholar
Pereira, S.S., Lopez-Valcarce, R., Pages-Zamora, A.: A diffusion-based EM algorithm for distributed estimation in unreliable sensor networks. IEEE Signal Process. Lett. 20(6), 595–598 (2013)
Article Google Scholar
Chen, J., Salim, M.B., Matsumoto, M.: A gaussian mixture model-based continuous boundary detection for 3d sensor networks. Sensors 10(8), 7632–7650 (2010)
Article Google Scholar
Miura, K., Noguchi, H., Kawaguchi, H., Yoshimoto, M.: A low memory bandwidth gaussian mixture model (GMM) processor for 20,000-word real-time speech recognition FPGA system. In: 2008 International Conference on ICECE Technology, December 2008
Google Scholar
Gupta, K., Owens, J.D.: Three-layer optimizations for fast GMM computations on GPU-like parallel processors. In: IEEE Workshop on Automatic Speech Recognition & Understanding, December 2009
Google Scholar
Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 1999
Google Scholar
Li, H., Achim, A., Bull, D.R.: GMM-based efficient foreground detection with adaptive region update. In: Proceedings of the 16th IEEE International Conference on Image Processing, November 2009
Google Scholar
Patel, C.I., Patel, R.: Gaussian mixture model based moving object detection from video sequence. In: Proceedings of the International Conference and Workshop on Emerging Trends in Technology, February 2011
Google Scholar
Song, Y., Li, X., Liu, Q.: Fast moving object detection using improved gaussian mixture models. In: International Conference on Audio, Language and Image Processing, July 2014
Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. In: Neurocomputing: Foundations of Research, January 1988
Google Scholar
Liu, Z., Li, H., Miao, G.: MapReduce-based backpropagation neural network over large scale mobile data. In: Sixth International Conference on Natural Computation, August 2010
Google Scholar
Gu, R., Shen, F., Huang, Y.: A parallel computing platform for training large scale neural networks. In: IEEE International Conference on Big Data, October 2013
Google Scholar
Hillis, W.D., Steele Jr., G.L.: Data parallel algorithms. Commun. ACM Spec. Issue Parallelism 29(12), 1170–1183 (1986)
Article Google Scholar
Flynn, M.J.: Some computer organizations and their effectiveness. IEEE Trans. Comput. C–21(9), 948–960 (1972)
Article MathSciNet MATH Google Scholar
Kwedlo, W.: A parallel EM algorithm for Gaussian mixture models implemented on a NUMA system using OpenMP. In: 22nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), February 2014
Google Scholar
Yang, R., Xiong, T., Chen, T., Huang, Z., Feng, S.: DISTRIM: parallel GMM learning on multicore cluster. In: IEEE International Conference on Computer Science and Automation Engineering (CSAE), May 2012
Google Scholar
Wolfe, J., Haghighi, A., Klein, D.: Fully distributed EM for very large datasets. In: Proceedings of the 25th International Conference on Machine Learning, July 2008
Google Scholar
Kumar, N.S.L.P., Satoor, S., Buck, L.: Fast parallel expectation maximization for gaussian mixture models on GPUs using CUDA. In: 11th IEEE International Conference on High Performance Computing and Communications, June 2009
Google Scholar
Machlica, L., Vanek, J., Zajic, Z.: Fast estimation of gaussian mixture model parameters on GPU using CUDA. In: 12th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), October 2011
Google Scholar
Altinigneli, M.C., Plant, C., Bohm, C.: Massively parallel expectation maximization using graphics processing units. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2013
Google Scholar
Bergstrom, L., Reppy, J.: Nested data-parallelism on the GPU. In: Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming, September 2012
Google Scholar
Lee, H., Brown, K.J., Sujeeth, A.K., Rompf, T., Olkotun, K.: Locality-aware mapping of nested parallel patterns on GPU. In: Proceedings of eht 47th Annual IEEE/ACM International Symposium on Microarchitecture, December 2014
Google Scholar
Feeley, M.: A message passing implementation of lazy task creation. In: Halstead, R.H., Ito, T. (eds.) PSC 1992. LNCS, vol. 748, pp. 94–107. Springer, Heidelberg (1993). doi:10.1007/BFb0018649
Chapter Google Scholar
Umatani, S., Yasugi, M., Komiya, T., Yuasa, T.: Pursuing laziness for efficient implementation of modern multithreaded languages. In: Veidenbaum, A., Joe, K., Amano, H., Aiso, H. (eds.) ISHPC 2003. LNCS, vol. 2858, pp. 174–188. Springer, Heidelberg (2003). doi:10.1007/978-3-540-39707-6_13
Chapter Google Scholar
Acar, U.A., Chargueraud, A., Rainey, M.: Scheduling parallel programs by work stealing with private deques. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2013
Google Scholar
Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, May 1998
Google Scholar
Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: Fifth Conference on Partitioned Global Address Space Programming Models, October 2011
Google Scholar
Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Prins, J.F.: Scheduling task parallelism on multi-socket multicore systems. In: Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers, May 2011
Google Scholar
Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Spiegel, M., Prins, J.F.: OpenMP task scheduling strategies for multicore numa systems. Int. J. High Perform. Comput. Appl. 26(2), 110–124 (2012)
Article Google Scholar
Nakashima, J., Nakatani, S., Taura, K.: Design and implementation of a customizable work stealing scheduler. In: 3rd International Workshop on Runtime and Operating Systems for Supercomputers, June 2013
Google Scholar
Kranz, D.A., Halstead, R.H., Mohr Jr., E.: Mul-T: a high-performance parallel lisp. In: Proceedings of the ACM SIGPLAN 1989 Conference on Programming Language Design and Implementation, June 1989
Google Scholar
Wheeler, K.B., Murphy, R.C., Thain, D.: Qthreads: an API for programming with millions of lightweight threads. In: IEEE International Symposium on Parallel and Distributed Processing, April 2008
Google Scholar
Molka, D., Hackenberg, D., Shone, R., Muller, M.S.: Memory performance and cache coherency effects on an intel nahalem multiprocessor system. In: 18th International Conference on Parallel Architectures and Compilation Techniques, September 2009
Google Scholar
Molka, D., Hackenberg, D., Schone, R., Nagel, W.E.: Cache coherence protocol and memory performance of the intel haswell-EP architecture. In: 44th International Conference on Parallel Processing, September 2015
Google Scholar
Charles, P., Donawa, C., Ebcioglu, K., Grothoff, C., Kielstra, A., von Praun, C., Saraswat, V., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, October 2005
Google Scholar
Callahan, D., Chamberlain, B.L., Zima, H.P.: The cascade high productivity language. In: 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments, April 2004
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, vol. 6, December 2004
Google Scholar
Furmento, N., Goglin, B.: Enabling high-performance memory migration for multithreaded applications on Linux. In: IEEE International Symposium on Parallel & Distributed Processing, May 2009
Google Scholar
Lameter, C.: NUMA (non-uniform memory access): an overview. Queue 11(7), 40 (2013)
Article Google Scholar
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.: GraphLab: a new framework for parallel machine learning. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, June 2010
Google Scholar
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: Proceedings of the VLDB Endowment, April 2012
Google Scholar
Hamidouche, K., Falcou, J., Etiemble, D.: A framework for an automatic hybrid MPI+ openMP code generation. In: Proceedings of the 19th High Performance Computing Symposia, April 2011
Google Scholar
Si, M., Pena, A.J., Balaji, P., Takagi, M., Ishikawa, Y.: MT-MPI: multithreaded MPI for many-core environments. In: Proceedings of the 28th ACM International Conference on Supercomputing, June 2014
Google Scholar
Luo, M., Lu, X., Hamidouche, K., Kandalla, K., Panda, D.K.: Initial study of multi-endpoint runtime for MPI+ openMP hybrid programming model on multi-core systems. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2014
Google Scholar
Kawakatsu, T., Kinoshita, A., Takasu, A., Adachi, J.: Highly efficient parallel framework: a divide-and-conquer approach. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9262, pp. 162–176. Springer, Heidelberg (2015). doi:10.1007/978-3-319-22852-5_15
Chapter Google Scholar
Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. In: Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, June 1998
Google Scholar
Kirk, D.B., Hwu, W.W.: Processors, Programming Massively Parallel: A Hands-on Approach. Morgan Kaufmann, San Francisco (2010)
Google Scholar
Nvidia. CUDA C programming guide version 6.5, August 2014
Google Scholar

Download references

Acknowledgment

This work was supported by the CPS-IIP (http://www.cps.nii.ac.jp.) project under the research promotion program for national challenges Research and development for the realization of the next-generation IT platforms of the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan. The experimental environment was made available by Assistant Prof. Hajime Imura at the Meme Media Laboratory, Hokkaido University, and Yasuhiro Shirai at HP Japan Inc.

Author information

Authors and Affiliations

The University of Tokyo, 2-1-2 Hitotsubashi, Chiyoda, Tokyo, Japan
Takaya Kawakatsu & Akira Kinoshita
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda, Tokyo, Japan
Atsuhiro Takasu & Jun Adachi

Authors

Takaya Kawakatsu
View author publications
You can also search for this author in PubMed Google Scholar
Akira Kinoshita
View author publications
You can also search for this author in PubMed Google Scholar
Atsuhiro Takasu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Adachi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takaya Kawakatsu .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University , Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz , Linz, Austria
Josef Küng
FAW, University of Linz , Linz, Austria
Roland Wagner
HP Labs , Sunnyvale, California, USA
Qimin Chen

A General EM Algorithm

1.1 A.1 EM on GMM

The GMM is a popular probabilistic model described by a weighted linear sum of K normal distributions:

$$\begin{aligned} p(\varvec{x}) = \sum _{k=1}^K w_k \mathcal {N}(\varvec{x};\varvec{\mu }_k,S_k), \end{aligned}$$

(1)

where $w_k$ is the weight, $\varvec{\mu }_k$ is the mean, and $S_k$ is the covariance matrix of the kth normal distribution. An observation item $\varvec{x}$ is generated by a normal distribution selected with a probability of $w_k$. We transcribe parameters $\theta _k = (w_k, \varvec{\mu }_k, S_k)$ for the sake of simplicity, and $\theta $ is the set of all $\theta _k$. The likelihood function $\mathcal {L}(\theta )$ indicates how likely it is that the probabilistic model regenerates the training dataset. Assuming independence among observation items, $\mathcal {L}(\theta )$ is equal to the joint probability of all observation data. $\mathcal {L}$ is defined in log-likelihood terms because $p(\varvec{x}_n|\theta )$ is very small:

$$\begin{aligned} \mathcal {L}(\theta ) = \sum _n^N \log \sum _k^K w_k \mathcal {N}(\varvec{x}_n;\varvec{\mu }_k,S_k). \end{aligned}$$

(2)

In the EM context, we need only maximize $\mathcal {L}$. However, because a GMM is a latent-variable model, it requires step-by-step improvement. The posterior probability $q_{nk}$ that the nth observation item $\varvec{x}_n$ is generated by the kth normal distribution is:

$$\begin{aligned} q_{nk} = \frac{w_k \mathcal {N}(\varvec{x}_n;\varvec{\mu }_k,S_k)}{\displaystyle \sum _k^K w_k \mathcal {N}(\varvec{x}_n;\varvec{\mu }_k,S_k)}. \end{aligned}$$

(3)

Of the two repeated steps, the E-step calculates $q_{nk}$ for all pairs of data $\varvec{x}_n$ and the kth normal distribution, and the M-step updates the parameters as follows:

$$\begin{aligned} \hat{w}_k&= \frac{1}{N} \sum _n^N q_{nk}, \end{aligned}$$

(4)

$$\begin{aligned} \hat{\varvec{\mu }}_k&= \frac{1}{N\hat{w}_k} \sum _n^N q_{nk} \varvec{x}_n, \end{aligned}$$

(5)

$$\begin{aligned} \hat{S}_k&= \frac{1}{N\hat{w}_k} \sum _n^N q_{nk} (\varvec{x}_n - \hat{\varvec{\mu }}_k)^T(\varvec{x}_n - \hat{\varvec{\mu }}_k). \end{aligned}$$

(6)

The E-step and M-step are repeated alternately until $\mathcal {L}$ converges. In practice, the covariance matrix $S_k$ is assumed to be a diagonal matrix and the calculation is therefore simplified as follows:

$$\begin{aligned} \hat{S}_{kd} = \frac{1}{N\hat{w}_k} \left( \sum _n^N \hat{q}_{nk} \varvec{x}_{nd}^2\right) - \hat{\varvec{\mu }}_{kd}^2. \end{aligned}$$

(7)

In the E-step, $N \times K$ $q_{nk}$ is calculated, and in the M-step, $q_{nk}$ is summed in the N axis and the parameter $\theta _k$ is updated. However, the posterior table can be too large and can exceed the hard-disk capacity when N is very large. Because of poor memory throughput, the processing speed will then degrade greatly. To avoid this condition, the parallel EM algorithm requires a large memory space.

1.2 A.2 EM on HPMM

Kinoshita et al. used an HPMM to detect traffic incidents [10]. They assumed that probe-car records follow a hierarchical PMM and that each road segment has its own local parameters. In their model, the probability of a single record $\varvec{x}$ in a segment s is described as follows:

$$\begin{aligned} p(\varvec{x}|s) = \sum _{k=1}^K w_{sk} \mathcal {P}(\varvec{x};\varvec{\mu }_k), \end{aligned}$$

(8)

where $w_{sk}$ is the kth Poisson distribution’s weight in segment s, and $\varvec{\mu }_k$ is the kth Poisson distribution’s mean. $w_{sk}$ is particular to the segment, whereas $\varvec{\mu }_k$ is common to all segments. The log-likelihood $\mathcal {L}(\theta )$ is defined as follows:

$$\begin{aligned} \mathcal {L}(\theta ) = \sum _{s=1}^S \sum _{n=1}^{N_s} \log \sum _{k=1}^K w_{sk} \mathcal {P}(\varvec{x}_{sn};\varvec{\mu }_k), \end{aligned}$$

(9)

where $N_s$ is the number of records in segment s. As for GMMs, we must calculate the posterior probability $q_{snk}$ that the nth record $\varvec{x}_{sn}$ in segment s is generated by the kth Poisson distribution for all pairs of (s, n, k) in each E-step:

$$\begin{aligned} q_{snk} = \frac{w_{sk} \mathcal {P}(\varvec{x}_{sn};\varvec{\mu }_k)}{\displaystyle \sum _{k=1}^K w_{sk} \mathcal {P}(\varvec{x}_{sn};\varvec{\mu }_k)}. \end{aligned}$$

(10)

In the M-step, the weight $w_{sk}$ and mean $\varvec{\mu }_k$ are recalculated:

$$\begin{aligned} \hat{w}_{sk}&= \frac{1}{N_s} \sum _{n=1}^{N_s} q_{snk}, \end{aligned}$$

(11)

$$\begin{aligned} \hat{\varvec{\mu }}_k&= \frac{\displaystyle \sum _{s=1}^S \sum _{n=1}^{N_s} q_{snk} \varvec{x}_{sn}}{\displaystyle \sum _{s=1}^S \sum _{n=1}^{N_s} q_{snk}}. \end{aligned}$$

(12)

Each road segment has a massive number of records, with the actual number varying greatly from segment to segment. This implies that we should take measures against load imbalance.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kawakatsu, T., Kinoshita, A., Takasu, A., Adachi, J. (2016). Divide-and-Conquer Parallelism for Learning Mixture Models. In: Hameurlain, A., Küng, J., Wagner, R., Chen, Q. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXVIII. Lecture Notes in Computer Science(), vol 9940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53455-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-662-53455-7_2
Published: 10 September 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53454-0
Online ISBN: 978-3-662-53455-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A General EM Algorithm

A General EM Algorithm

1.1 A.1 EM on GMM

1.2 A.2 EM on HPMM

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation