Abstract
We consider the problem of learning a federated model where the number of communication rounds is severely limited. We discuss recent works on model fusion, a special case of Federated Learning where only a single communication round is allowed. This setting has a unique feature where it is sufficient for clients to have a pre-trained model, but not the data. Data storage regulations such as GDPR make this setting appealing as the data can be immediately deleted after updating the local model before FL starts. However, model fusion methods are limited to relatively shallow neural network architectures. We discuss extensions of model fusion applicable to deep learning models that require more than one communication round, but remain very efficient in terms of communication budget, i.e., number of communication rounds and size of the messages exchanged between the clients and the server. We consider both homogeneous and heterogeneous client data scenarios, including scenarios where training on the aggregated data is suboptimal due to biases in the data. In addition to deep learning methods, we cover unsupervised settings such as mixture models, topic models, and hidden Markov models.
We compare the statistical efficiency of model fusion to that of a hypothetical centralized approach in which a learner with unlimited compute and storage capacity simply aggregates data from all clients and trains a model in a non-federated way. As we shall see, although the model fusion approach generally matches the convergence rate of the (hypothetical) centralized approach, it may not have the same efficiency. Further, this discrepancy between the centralized and federated approaches is amplified when client data is heterogeneous.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
To keep things simple, we assume the set of all possible parameters Θ, called the parameter space, is an open subset of \({\mathbb {R}}^d\).
- 2.
Its variance is O(1).
References
Agueh M, Carlier G (2011) Barycenters in the Wasserstein space. SIAM J Math Anal 43:904–924
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6:1345–1382
Bardenet R, Doucet A, Holmes C (2017) On Markov chain Monte Carlo methods for tall data. J Mach Learn Res 18(1):1515–1557
Betancourt M (2017) A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:170102434
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Blei DM, Jordan MI (2006) Variational inference for Dirichlet process mixtures. Bayesian Anal 1:121–143
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112(518):859–877
Breiman L (2001) Random forests. Mach Learn 45:5–32
Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI (2013) Streaming variational Bayes. In: Advances in neural information processing systems
Bui TD, Nguyen CV, Swaroop S, Turner RE (2018) Partitioned variational inference: a unified framework encompassing federated and continual learning. arXiv preprint arXiv:181111206
Campbell T, How JP (2014) Approximate decentralized Bayesian inference. arXiv:14037471
Carli FP, Ning L, Georgiou TT (2013) Convex clustering via optimal mass transport. arXiv:13075459
Carpenter B, Gelman A, Hoffman M, Lee D, Goodrich B, Betancourt M, Brubaker MA, Guo J, Li P, Riddell A et al (2017) Stan: a probabilistic programming language. J Stat Softw 76:1–32
Claici S, Yurochkin M, Ghosh S, Solomon J (2020) Model fusion with Kullback-Leibler divergence. In: International conference on machine learning
Cooper Y (2018) The loss landscape of overparameterized neural networks. arXiv preprint arXiv:180410200
Das R, Zaheer M, Dyer C (2015) Gaussian LDA for topic models with word embeddings. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 1: Long Papers)
Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier systems
Draxler F, Veschgini K, Salmhofer M, Hamprecht F (2018) Essentially no barriers in neural network energy landscape. In: International conference on machine learning
EU (2016) Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Official Journal of the European Union
Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1:209–230
Garipov T, Izmailov P, Podoprikhin D, Vetrov D, Wilson AG (2018) Loss surfaces, mode connectivity, and fast ensembling of DNNs. arXiv preprint arXiv:180210026
Ghahramani Z, Griffiths TL (2005) Infinite latent feature models and the Indian buffet process. In: Advances in neural information processing systems
Gromov M, Katz M, Pansu P, Semmes S (1999) Metric structures for Riemannian and non-Riemannian spaces, vol 152. Birkhäuser, Boston
Hasenclever L, Webb S, Lienart T, Vollmer S, Lakshminarayanan B, Blundell C, Teh YW (2017) Distributed Bayesian learning with stochastic natural gradient expectation propagation and the posterior server. J Mach Learn Res 18:1–37
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Hsu TMH, Qi H, Brown M (2019) Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:190906335
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233
Kuhn HW (1955) The Hungarian method for the assignment problem. Nav Res Logist (NRL) 2:83–97
LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE
Li H, Xu Z, Taylor G, Studer C, Goldstein T (2017) Visualizing the loss landscape of neural nets. arXiv preprint arXiv:171209913
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–137
Loiola EM, de Abreu NMM, Boaventura-Netto PO, Hahn P, Querido T (2007) A survey for the quadratic assignment problem. Eur J Oper Res 176:657–690
Ludwig H, Baracaldo N, Thomas G, Zhou Y, Anwar A, Rajamoni S, Ong Y, Radhakrishnan J, Verma A, Sinn M et al (2020) IBM federated learning: an enterprise framework white paper v0. 1. arXiv preprint arXiv:200710987
McMahan B, Moore E, Ramage D, Hampson S, y Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics
Minka TP (2001) Expectation propagation for approximate Bayesian inference. In: Conference on uncertainty in artificial intelligence
Neal RM (2012) Bayesian learning for neural networks. Springer Science & Business Media, Berlin/Heidelberg
Neal RM et al (2011) MCMC using Hamiltonian dynamics. Handb Markov Chain Monte Carlo 2(11):2
Opper M (1998) A Bayesian approach to on-line learning. On-line Learning in Neural Networks
Peyré G, Cuturi M, Solomon J (2016) Gromov-Wasserstein averaging of kernel and distance matrices. In: International conference on machine learning
Sahu AK, Li T, Sanjabi M, Zaheer M, Talwalkar A, Smith V (2018) On the convergence of federated optimization in heterogeneous networks. arXiv preprint arXiv:181206127
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
Singh SP, Jaggi M (2019) Model fusion via optimal transport. arXiv preprint arXiv:191005653
Srivastava S, Cevher V, Dinh Q, Dunson D (2015) Wasp: scalable Bayes via barycenters of subset posteriors. In: Artificial intelligence and statistics
Teh YW, Grür D, Ghahramani Z (2007) Stick-breaking construction for the Indian buffet process. In: Artificial intelligence and statistics
Thibaux R, Jordan MI (2007) Hierarchical Beta processes and the Indian buffet process. In: Artificial intelligence and statistics
Wainwright MJ, Jordan MI et al (2008) Graphical models, exponential families, and variational inference. Found Trends® Mach Learn 1:1–305
Wang H, Yurochkin M, Sun Y, Papailiopoulos D, Khazaeni Y (2020) Federated learning with matched averaging. In: International conference on learning representations
Yurochkin M, Agarwal M, Ghosh S, Greenewald K, Hoang N (2019) Statistical model aggregation via parameter matching. In: Advances in neural information processing systems
Yurochkin M, Agarwal M, Ghosh S, Greenewald K, Hoang N, Khazaeni Y (2019) Bayesian nonparametric federated learning of neural networks. In: International conference on machine learning
Yurochkin M, Fan Z, Guha A, Koutris P, Nguyen X (2019) Scalable inference of topic evolution via models for latent geometric structures. In: Advances in neural information processing systems
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Yurochkin, M., Sun, Y. (2022). Communication-Efficient Model Fusion. In: Ludwig, H., Baracaldo, N. (eds) Federated Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-96896-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-96896-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96895-3
Online ISBN: 978-3-030-96896-0
eBook Packages: Computer ScienceComputer Science (R0)