Fast and efficient Bayesian semiparametric curvefitting and clustering in massive data
 Sabyasachi Mukhopadhyay,
 Sisir Roy,
 Sourabh Bhattacharya
 … show all 3 hide
Rent the article at a discount
Rent now* Final gross prices may vary according to local VAT.
Get AccessAbstract
The problem of curvefitting and clustering using Bayesian mixture models, treating the number of components as unknown, has received wide attention in the Bayesian statistical community. Among a number of available Bayesian methodologies specialised for the purpose, the approaches proposed in Escobar and West (1995) and Richardson and Green (1997) stand out. But in the case of massive data substantial computational challenges seem to blur the attractive theoretical advantages of such pioneering Bayesian methodologies. Based on a methodology introduced by Bhattacharya (2008), which, as we show, includes the approach of Escobar and West (1995) as a special case, we propose a very fast and efficient curvefitting and clustering methodology. Our clustering approach is based on a new approach to analysing nonparametric posterior distributions of clusterings first proposed in Mukhopadhyay, Bhattacharya and Dihidar (2011). Significant advantages of our approach over the aforementioned established mixture modeling approaches, particularly in the case of massive data, are demonstrated theoretically and with extensive simulation studies. We also illustrate our methodologies on a real, cosmological data set consisting of 96,307 bivariate observations and demonstrate that the approach of Escobar and West (1995) is infeasible in this example and the approach of Richardson and Green (1997), although implementable, is likely to be inefficient and computationally expensive.
 Abramowitz, M, Stegun, IA Stirling Numbers of the Second Kind. In: Abramowitz, M, Stegun, IA eds. (1972) Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, pp. 824825
 Antoniak, CE (1974) Mixtures of Dirichlet processes with applications to nonparametric problems. Ann. Statist. 2: pp. 11521174 CrossRef
 Bell, ET (1934) Exponential numbers. Amer. Math. Monthly 41: pp. 411419 CrossRef
 Bhattacharya, S (2008) Gibbs sampling based Bayesian analysis of mixtures with unknown number of components. Sankhya. Series B 70: pp. 133155
 Bush, CA, MacEachern, SN (1996) A semiparametric Bayesian model for randomised block designs. Biometrika 83: pp. 275285 CrossRef
 Carlin, BP, Gelfand, AE, Smith, AFM (1992) Hierarchical Bayesian analysis of changepoint problems. Appl. Stat. 41: pp. 389405 CrossRef
 Dahl, D (2009) Modal clustering in a class of product partition models. Bayesian Anal. 4: pp. 243264 CrossRef
 Dalal, SR, Hall, WJ (1983) Approximating priors by mixtures of natural conjugate priors. J. R. Stat. Soc. Ser. B. 45: pp. 278286
 Diaconis, P, Ylvisaker, D Quantifying Prior Opinion (with discussion). In: Bernardo, JM, DeGroot, MH, Lindley, DV, Smith, AFM eds. (1985) Bayesian Statistics 2. Amsterdam, NorthHolland, pp. 133156
 Escobar, MD, West, M (1995) Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90: pp. 577588 CrossRef
 Ferguson, TS (1974) A Bayesian analysis of some nonparametric problems. Ann. Statist. 1: pp. 209230 CrossRef
 Ghosh, JK, Dihidar, K, Samanta, T On Different Clusterings of the Same Data Set. In: Arnold, B, Gather, U, Bendre, SM eds. (2009) Felicitation Volume in Honour of Prof. B. K. Kale. MacMillan, New Delhi
 Jain, S, Neal, RM (2004) A splitmerge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Statist. 13: pp. 158182 CrossRef
 Jain, S, Neal, RM (2007) Splitting and merging components of a nonconjugate Dirichlet process mixture model (with discussion). Bayesian Anal. 2: pp. 445472 CrossRef
 Jensen, ST, Liu, JS (2008) Bayesian clustering of transcription factor binding Motiffs. J. Amer. Statist. Assoc. 103: pp. 188200 CrossRef
 Lee, K., Marin, J.M., Mengersen, K. and Robert, C.P. (2008). Bayesian inference on mixtures of distributions.
 MacEachern, SN (1994) Estimating normal means with a conjugatestyle Dirichlet process prior. Comm. Statist. Simulation Comput. 23: pp. 727741 CrossRef
 Mccullagh, P, Yang, J (2008) How many clusters?. Bayesian Anal. 3: pp. 101120 CrossRef
 Mclachlan, GJ, Basford, KE (1988) Mixture Models: Inference and Applications to Clustering. Dekker, New York
 Mukhopadhyay, S, Bhattacharya, S (2012) Perfect simulation for mixtures with known and unknown number of components. Bayesian Anal. 7: pp. 675714 CrossRef
 Mukhopadhyay, S, Bhattacharya, S, Dihidar, K (2011) On Bayesian central clustering: application to landscape classification of Western Ghats. Ann. Appl. Stat. 5: pp. 19481977 CrossRef
 Müller, P, Erkanli, A, West, M (1996) Bayesian curve fitting using multivariate normal mixtures. Biometrika 83: pp. 6779 CrossRef
 Quintana, FA, Iglesias, PL (2003) Bayesian clustering and product partition models. J. R. Stat. Soc. Ser. B. 65: pp. 557574 CrossRef
 Richardson, S, Green, PJ (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. R. Stat. Soc. Ser. B. 59: pp. 731792 CrossRef
 Titterington, DM, Smith, AFM, Makov, UE (1985) Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, New York
 Wang, L, Dunson, D (2011) Fast Bayesian inference in Dirichlet process mixture models. J. Comput. Graph. Statist. 20: pp. 196216 CrossRef
 Title
 Fast and efficient Bayesian semiparametric curvefitting and clustering in massive data
 Journal

Sankhya B
Volume 74, Issue 1 , pp 77106
 Cover Date
 20120501
 DOI
 10.1007/s1357101200441
 Print ISSN
 09768386
 Online ISSN
 09768394
 Publisher
 SpringerVerlag
 Additional Links
 Topics
 Keywords

 Cluster analysisa
 Cosmology
 Dirichlet process
 Model validation
 Markov chain Monte Carlo
 Nonlinear regression
 Reversible jump Markov chain Monte Carlo
 Primary 62G08
 Secondary 91C20
 Authors

 Sabyasachi Mukhopadhyay ^{(1)}
 Sisir Roy ^{(2)}
 Sourabh Bhattacharya ^{(1)}
 Author Affiliations

 1. Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata, 700 108, India
 2. Physics and Applied Mathematics Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata, 700 108, India