A flexible probabilistic framework for large-margin mixture of experts
Mixture-of-Experts (MoE) enable learning highly nonlinear models by combining simple expert models. Each expert handles a small region of the data space, as dictated by the gating network which generates the (soft) assignment of input to the corresponding experts. Despite their flexibility and renewed interest lately, existing MoE constructions pose several difficulties during model training. Crucially, neither of the two popular gating networks used in MoE, namely the softmax gating network and hierarchical gating network (the latter used in the hierarchical mixture of experts), have efficient inference algorithms. The problem is further exacerbated if the experts do not have conjugate likelihood and lack a naturally probabilistic formulation (e.g., logistic regression or large-margin classifiers such as SVM). To address these issues, we develop novel inference algorithms with closed-form parameter updates, leveraging some of the recent advances in data augmentation techniques. We also present a novel probabilistic framework for MoE, consisting of a range of gating networks with efficient inference made possible through our proposed algorithms. We exploit this framework by using Bayesian linear SVMs as experts on various classification problems (which has a non-conjugate likelihood otherwise generally), providing our final model with attractive large-margin properties. We show that our models are significantly more efficient than other training algorithms for MoE while outperforming other traditional non-linear models like Kernel SVMs and Gaussian Processes on several benchmark datasets.
KeywordsProbabilistic modelling Mixture of experts Bayesian SVMs
Piyush Rai acknowledges support from Visvesvaraya Young Faculty Research Fellowship from MEITY India.
- Bishop, C. M., & Svenskn, M. (2002). Bayesian hierarchical mixtures of experts. In: UAI.Google Scholar
- Cotter, A., Shalev-Shwartz, S., & Srebro, N. (2013). Learning optimally sparse support vector machines. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13) (pp. 266–274).Google Scholar
- Henao, R., Yuan, X., & Carin, L. (2014). Bayesian nonlinear support vector machines and discriminative factor modeling. In: Advances in neural information processing systems (pp. 1754–1762).Google Scholar
- Meeds, E., & Osindero, S. (2006). An alternative infinite mixture of Gaussian process experts. In: Advances in neural information processing systems (pp. 883–890).Google Scholar
- Rahimi, A., & Recht, B. (2008) Random features for large-scale kernel machines. In: Advances in neural information processing systems (pp. 1177–1184).Google Scholar
- Rigon, T., & Durante, D. (2017). Tractable Bayesian density regression via logit stick-breaking priors. ArXiv e-prints arXiv:1701.02969.
- Scott, J. G., & Sun, L. (2013). Expectation–maximization for logistic regression. ArXiv preprint arXiv:1306.0040.
- Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv preprint arXiv:1701.06538.
- Wang, Z., Djuric, N., Crammer, K., & Vucetic, S. (2011). Trading representability for scalability: Adaptive multi-hyperplane machine for nonlinear classification. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 24–32). ACM.Google Scholar
- Wang, Y., & Zhu, J. (2014). Small-variance asymptotics for dirichlet process mixtures of SVMs. Palo Alto: AAAI.Google Scholar
- Williams, C. K., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In: Advances in neural information processing systems (pp. 682–688).Google Scholar
- Xu, L., Jordan, M. I., & Hinton, G. E. (1995). An alternative model for mixtures of experts. In: Advances in neural information processing systems (pp. 633–640).Google Scholar
- Yuan, C., & Neubauer, C. (2009). Variational mixture of Gaussian process experts. In: Advances in neural information processing systems (pp. 1897–1904).Google Scholar
- Zhou, M. (2016). Softplus regressions and convex polytopes. ArXiv e-prints arXiv:1608.06383.
- Zhu, J., Chen, N., & Xing, E.P. (2011). Infinite SVM: A dirichlet process mixture of large-margin kernel machines. In: ICML.Google Scholar