Machine Learning

, Volume 108, Issue 8–9, pp 1369–1393 | Cite as

A flexible probabilistic framework for large-margin mixture of experts

  • Archit Sharma
  • Siddhartha SaxenaEmail author
  • Piyush Rai
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2019 Journal Track


Mixture-of-Experts (MoE) enable learning highly nonlinear models by combining simple expert models. Each expert handles a small region of the data space, as dictated by the gating network which generates the (soft) assignment of input to the corresponding experts. Despite their flexibility and renewed interest lately, existing MoE constructions pose several difficulties during model training. Crucially, neither of the two popular gating networks used in MoE, namely the softmax gating network and hierarchical gating network (the latter used in the hierarchical mixture of experts), have efficient inference algorithms. The problem is further exacerbated if the experts do not have conjugate likelihood and lack a naturally probabilistic formulation (e.g., logistic regression or large-margin classifiers such as SVM). To address these issues, we develop novel inference algorithms with closed-form parameter updates, leveraging some of the recent advances in data augmentation techniques. We also present a novel probabilistic framework for MoE, consisting of a range of gating networks with efficient inference made possible through our proposed algorithms. We exploit this framework by using Bayesian linear SVMs as experts on various classification problems (which has a non-conjugate likelihood otherwise generally), providing our final model with attractive large-margin properties. We show that our models are significantly more efficient than other training algorithms for MoE while outperforming other traditional non-linear models like Kernel SVMs and Gaussian Processes on several benchmark datasets.


Probabilistic modelling Mixture of experts Bayesian SVMs 



Piyush Rai acknowledges support from Visvesvaraya Young Faculty Research Fellowship from MEITY India.


  1. Balakrishnan, S., Wainwright, M. J., Yu, B., et al. (2017). Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1), 77–120.MathSciNetCrossRefzbMATHGoogle Scholar
  2. Bishop, C. M., & Svenskn, M. (2002). Bayesian hierarchical mixtures of experts. In: UAI.Google Scholar
  3. Cotter, A., Shalev-Shwartz, S., & Srebro, N. (2013). Learning optimally sparse support vector machines. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13) (pp. 266–274).Google Scholar
  4. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society Series B (Methodological), 39, 1–38.MathSciNetCrossRefzbMATHGoogle Scholar
  5. Henao, R., Yuan, X., & Carin, L. (2014). Bayesian nonlinear support vector machines and discriminative factor modeling. In: Advances in neural information processing systems (pp. 1754–1762).Google Scholar
  6. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.CrossRefGoogle Scholar
  7. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214.CrossRefGoogle Scholar
  8. Masoudnia, S., & Ebrahimpour, R. (2014). Mixture of experts: A literature survey. Artificial Intelligence Review, 42(2), 275–293.CrossRefGoogle Scholar
  9. Meeds, E., & Osindero, S. (2006). An alternative infinite mixture of Gaussian process experts. In: Advances in neural information processing systems (pp. 883–890).Google Scholar
  10. Nickisch, H., & Rasmussen, C. E. (2008). Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9, 2035–2078.MathSciNetzbMATHGoogle Scholar
  11. Polson, N. G., Scott, S. L., et al. (2011). Data augmentation for support vector machines. Bayesian Analysis, 6(1), 1–23.MathSciNetCrossRefzbMATHGoogle Scholar
  12. Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using pólya-gamma latent variables. Journal of the American statistical Association, 108(504), 1339–1349.MathSciNetCrossRefzbMATHGoogle Scholar
  13. Rahimi, A., & Recht, B. (2008) Random features for large-scale kernel machines. In: Advances in neural information processing systems (pp. 1177–1184).Google Scholar
  14. Ren, L., Du, L., Carin, L., & Dunson, D. (2011). Logistic stick-breaking process. Journal of Machine Learning Research, 12, 203–239.MathSciNetzbMATHGoogle Scholar
  15. Rigon, T., & Durante, D. (2017). Tractable Bayesian density regression via logit stick-breaking priors. ArXiv e-prints arXiv:1701.02969.
  16. Scott, J. G., & Sun, L. (2013). Expectation–maximization for logistic regression. ArXiv preprint arXiv:1306.0040.
  17. Shahbaba, B., & Neal, R. (2009). Nonlinear models using Dirichlet process mixtures. Journal of Machine Learning Research, 10, 1829–1850.MathSciNetzbMATHGoogle Scholar
  18. Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  19. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv preprint arXiv:1701.06538.
  20. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.MathSciNetzbMATHGoogle Scholar
  21. Wang, Z., Djuric, N., Crammer, K., & Vucetic, S. (2011). Trading representability for scalability: Adaptive multi-hyperplane machine for nonlinear classification. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 24–32). ACM.Google Scholar
  22. Wang, Y., & Zhu, J. (2014). Small-variance asymptotics for dirichlet process mixtures of SVMs. Palo Alto: AAAI.Google Scholar
  23. Williams, C. K., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In: Advances in neural information processing systems (pp. 682–688).Google Scholar
  24. Xu, L., Jordan, M. I., & Hinton, G. E. (1995). An alternative model for mixtures of experts. In: Advances in neural information processing systems (pp. 633–640).Google Scholar
  25. Yuan, C., & Neubauer, C. (2009). Variational mixture of Gaussian process experts. In: Advances in neural information processing systems (pp. 1897–1904).Google Scholar
  26. Yuksel, S. E., Wilson, J. N., & Gader, P. D. (2012). Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8), 1177–1193.CrossRefGoogle Scholar
  27. Zhou, M. (2016). Softplus regressions and convex polytopes. ArXiv e-prints arXiv:1608.06383.
  28. Zhu, J., Chen, N., & Xing, E.P. (2011). Infinite SVM: A dirichlet process mixture of large-margin kernel machines. In: ICML.Google Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Google AI ResidentGoogle BrainMountain ViewUSA
  2. 2.IIT KanpurKanpurIndia

Personalised recommendations