Abstract
The paper describes and applies a fully Bayesian approach to soft clustering and classification using mixed membership models. Our model structure has assumptions on four levels: population, subject, latent variable, and sampling scheme. Population level assumptions describe the general structure of the population that is common to all subjects. Subject level assumptions specify the distribution of observable responses given individual membership scores. Membership scores are usually unknown and hence we can also view them as latent variables, treating them as either fixed or random in the model. Finally, the last level of assumptions specifies the number of distinct observed characteristics and the number of replications for each characteristic. We illustrate the flexibility and utility of the general model through two applications using data from: (i) the National Long Term Care Survey where we explore types of disability; (ii) abstracts and bibliographies from articles published in The Proceedings of the National Academy of Sciences. In the first application we use a Monte Carlo Markov chain implementation for sampling from the posterior distribution. In the second application, because of the size and complexity of the data base, we use a variational approximation to the posterior. We also include a guide to other applications of mixed membership modeling.
Chapter PDF
Similar content being viewed by others
Keywords
- Latent Dirichlet Allocation
- Dirichlet Distribution
- Probabilistic Latent Semantic Analysis
- Soft Cluster
- Membership Score
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
BARNARD, K., DUYGULU, P., FORSYTH, D., de FREITAS, N., BLEI, D. M. and JORDAN, M. I. (2003): Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.
BLEI, D. M. and JORDAN, M. I. (2003a): Modeling annotated data. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 127–134.
BLEI, D. M., JORDAN, M. I. and NG, A. Y. (2003b): Latent Dirichlet models for application in information retrieval. In J. Bernardo, et al. eds., Bayesian Statistics 7. Proceedings of the Seventh Valencia International Meeting, Oxford University Press, Oxford, 25–44.
BLEI, D. M., NG, A. Y. and JORDAN, M. I. (2003c): Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1002.
BRANDTBERG, T. (2002): Individual tree-based species classification in high spatial resolution aerial images of forests using fuzzy sets. Fuzzy Sets and Systems, 132, 371–387.
COHN, D. and HOFMANN, T. (2001): The missing link: A probabilistic model of document content and hypertext connectivity. Neural Information Processing Systems (NIPS*13), MIT Press.
COOIL, B. and VARKI, S. (2003): Using the conditional Grade-of-Membership model to assess judgment accuracy. Psychometrika, 68, 453–471.
DENISON, D.G.T., HOLMES, C.C., MALLICK, B.K., and SMITH, A.F.M. (2002): Bayesian Methods for Nonlinear Classification and Regression. Wiley, New York.
EROSHEVA, E. A. (2002): Grade of Membership and Latent Structure Models With Applicsation to Disability Survey Data. Ph.D. Dissertation, Department of Statistics, Carnegie Mellon University. PhD thesis, Carnegie Mellon University.
EROSHEVA, E. A. (2003a): Bayesian estimation of the Grade of Membership Model. In J. Bernardo et al. (Eds.): Bayesian Statistics 7. Proceedings of the Seventh Valencia International Meeting, Oxford University Press, Oxford, 501–510.
EROSHEVA, E. A. (2003b): Partial Membership Models With Application to Disability Survey Data In H. Bozdogan (Ed.): New Frontiers of Statistical Data Mining, Knowledge Discovery, and E-Business, CRC Press, Boca Raton, 117–134.
EROSHEVA, E.A., FIENBERG, S.E. and LAFFERTY, J. (2004): Mixed Membership Models of Scientific Publications. Proceedings of the National Academy of Sciences, in press.
GRIFFITHS, T. L. and STEYVERS, M. (2004): Finding scientific topics. Proceedings of the National Academy of Sciences, in press.
HOFMANN, T. (2001): Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177–196.
KOVTUN, M., AKUSHEVICH, I., MANTON, K.G. and TOLLEY, H.D. (2004a): Grade of membership analysis: Newest development with application to National Long Term Care Survey. Unpublished paper presented at Annual Meeting of Population Association of America (dated March 18, 2004).
KOVTUN, M., AKUSHEVICH, I., MANTON, K.G. and TOLLEY, H.D. (2004b): Grade of membership analysis: One possible approach to foundations. Unpublished manuscript.
MANTON, K. G., WOODBURY, M. A. and TOLLEY, H. D. (1994): Statistical Applications Using Fuzzy Sets. Wiley, New York.
MINKA, T. P. and LAFFERTY, J., (2002): Expectation-propagation for the generative aspect model. Uncertainty in Artificial Intelligence: Proceedings of the Eighteenth Conference (UAI-2002), Morgan Kaufmann, San Francisco, 352–359.
NURMBERG, H.G., WOODBURY, M.A. and BOGENSCHUTZ, M.P. (1999): A mathematical typology analysis of DSM-III-R personality disorder classification: grade of membership technique. Compr Psychiatry, 40, 61–71.
POTTHOFF, R. F., MANTON, K. G. and WOODBURY, M. A., (2000): Dirichlet generalizations of latent-class models. Journal of Classification, 17, 315–353.
PRITCHARD, J. K., STEPHENS, M. and DONNELLY, P., (2000): Inference of population structure using multilocus genotype data. Genetics, 155, 945–959.
ROSENBERG, N. A., PRITCHARD, J. K., WEBER, J. L., CANN, H. M., KIDD, K. K., ZHIVOTOVSKY, L. A. and FELDMAN, M. W. (2002): Genetic structure of human populations. Science, 298, 2381–2385.
SEETHARAMAN, P.B., FEINBERG, F.M. and CHINTGUNTA, P.K. (2001): Product line management as dynamic, attribute-level competition. Unpublished manuscript.
SPIEGELHALTER, D. J., BEST, N. G., CARLIN, B. P. and VAN DER LINDE, A. (2002) Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B, Methodological, 64, 1–34.
TALBOT, B.G., WHITEHEAD, B.B. and TALBOT, L.M. (2002): Metric Estimation via a Fuzzy Grade-of-Membership Model Applied to Analysis of Business Opportunities. 14th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2002, 431–437.
TALBOT, L.M. (1996): A Statistical Fuzzy Grade-of-Membership Approach to Unsupervised Data Clustering with Application to Remote Sensing. Unpublished Ph.D. dissertation, Department of Electrical and Computer Engineering, Brigham Young University.
VARKI, S. and CHINTAGUNTA, K. (2003): The augmented latent class model: Incorporating additional heterogeneity in the latent class model for panel data. Journal of Marketing Research, forthcoming.
VARKI, S., COOIL, B. and RUST, R.T. (2000): Modeling Fuzzy Data in Qualitative Marketing Research. Journal of Marketing Research, XXXVII, 480–489.
WOODBURY, M. A. and CLIVE, J. (1974): Clinical pure types as a fuzzy partition. Journal of Cybernetics, 4, 111–121.
WOODBURY, M. A., CLIVE, J. and GARSON, A. (1978): Mathematical typology: A Grade of Membership technique for obtaining disease definition. Computers and Biomedical Research, 11, 277–298.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin · Heidelberg
About this paper
Cite this paper
Erosheva, E.A., Fienberg, S.E. (2005). Bayesian Mixed Membership Models for Soft Clustering and Classification. In: Weihs, C., Gaul, W. (eds) Classification — the Ubiquitous Challenge. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-28084-7_2
Download citation
DOI: https://doi.org/10.1007/3-540-28084-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25677-9
Online ISBN: 978-3-540-28084-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)