Abstract
We present a class of unsupervised statistical learning algorithms that are formulated in terms of minimizing Bregman divergences— a family of generalized entropy measures defined by convex functions. We obtain novel training algorithms that extract hidden latent structure by minimizing a Bregman divergence on training data, subject to a set of non-linear constraints which consider hidden variables. An alternating minimization procedure with nested iterative scaling is proposed to find feasible solutions for the resulting constrained optimization problem. The convergence of this algorithm along with its information geometric properties are characterized.
Index Terms — statistical machine learning, unsupervised learning, Bregman divergence, information geometry, alternating minimization, forward projection, backward projection, iterative scaling.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bauschke, H., Borwein, J.: Joint and Separate Convexity of the Bregman Distance. In: Inherently Parallel Algorithms in Feasibility and Optimization and Their Applications, pp. 23–36. Elsevier, Amsterdam (2001)
Borwein, J., Lewis, A.: Duality relationships for entropy-like minimization problems. SIAM J. Control Optim. 29(2), 325–338 (1991)
Borwein, J., Lewis, A.: Convex Analysis and Nolinear Optimization. Springer, Heidelberg (2000)
Bregman, L.: The relaxation method of finding the common point of convex sets and its applications to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics 7, 200–217 (1967)
Buja, A., Stuetzle, W.: Degrees of Boosting (2002) (manuscript)
Byrne, C., Censor, Y.: Proximity function minimization using multiple Bregman projections with applications to split feasibility and Kullback-Leibler distance minimization. Annals of Operations Research 105, 77–98 (2001)
Censor, Y., Zenios, S.: Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, Oxford (1997)
Collins, M., Schapire, R., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. Machine Learning 48(1–3), 253–285 (2002)
Csiszar, I., Tusnady, G.: Information geometry and alternating minimization procedures. Statistics and Decisions 1, 205–237 (1984)
Csiszar, I.: Why least squares and maximum entropy? The Annals of Statisics 19(4), 2032–2066 (1991)
Csiszar, I.: Generalized projections for non-negative functions. Acta Mathematica Hungarica 68(1–2), 161–185 (1995)
Csiszar, I.: Maxent, mathematics, and information theory. In: Hanson, K., Silver, R. (eds.) Maximum Entropy and Bayesian Methods, pp. 35–50. Kluwer, Dordrecht (1996)
Della Pietra, S., Della Pietra, V., Lafferty, J.: Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS-01-109, CMU (2001)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood estimation from incomplete data via the EM algorithm. J. Royal Stat. Soc. B 39, 1–38 (1977)
Eggermont, P., LaRiccia, V.: On EM-like algorithms for minimum distance estimation. Technical Report, Mathematical Sciences, University of Delaware (1998)
Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: A statistical view of boosting. Annals of Statistics 28(2), 337–407 (2000)
Johnson, R., Shore, J.: Which is the better entropy expression for speech processing: -S log S or log S? IEEE Transactions on Acoustics, Speech, and Signal Processing 32(1), 129–137 (1984)
Lafferty, J., Della Pietra, S., Della Pietra, V.: Statistical learning algorithms based on Bregman distances. In: Canadian Workshop on Info. Theory, pp. 77–80 (1997)
Lafferty, J.: Additive models, boosting, and inference for generalized divergences. In: Annual Conference on Computational Learning Theory, pp. 125–133 (1999)
Lebanon, G., Lafferty, J.: Boosting and maximum likelihood for exponential models. Advances in Neural Information Processing Systems (NIPS) 14 (2002)
Luenberger, D.: Optimization by Vector Space Methods. John Wiley & Sons, Chichester (1969)
Vapnik, V.: The Natural of Statistical Learning Theory. Springer, Heidelberg (2000)
Zhang, T.: Statistical behavior and consistency of classification methods based on convex risk minimization. To appear in Annals of Statistics (2004)
Wang, S., Schuurmans, D., Zhao, Y.: The latent maximum entropy principle (2002) (manuscript)
Wang, S., Schuurmans, D., Ghodsi, A., Rosenthal, J.: Unsupervised Boosting with the Latent Maximum Entropy Principle (2003) (manuscript)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, S., Schuurmans, D. (2003). Learning Continuous Latent Variable Models with Bregman Divergences. In: Gavaldá, R., Jantke, K.P., Takimoto, E. (eds) Algorithmic Learning Theory. ALT 2003. Lecture Notes in Computer Science(), vol 2842. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39624-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-39624-6_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20291-2
Online ISBN: 978-3-540-39624-6
eBook Packages: Springer Book Archive