Fast and efficient Bayesian semi-parametric curve-fitting and clustering in massive data
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.Get Access
The problem of curve-fitting and clustering using Bayesian mixture models, treating the number of components as unknown, has received wide attention in the Bayesian statistical community. Among a number of available Bayesian methodologies specialised for the purpose, the approaches proposed in Escobar and West (1995) and Richardson and Green (1997) stand out. But in the case of massive data substantial computational challenges seem to blur the attractive theoretical advantages of such pioneering Bayesian methodologies. Based on a methodology introduced by Bhattacharya (2008), which, as we show, includes the approach of Escobar and West (1995) as a special case, we propose a very fast and efficient curve-fitting and clustering methodology. Our clustering approach is based on a new approach to analysing non-parametric posterior distributions of clusterings first proposed in Mukhopadhyay, Bhattacharya and Dihidar (2011). Significant advantages of our approach over the aforementioned established mixture modeling approaches, particularly in the case of massive data, are demonstrated theoretically and with extensive simulation studies. We also illustrate our methodologies on a real, cosmological data set consisting of 96,307 bivariate observations and demonstrate that the approach of Escobar and West (1995) is infeasible in this example and the approach of Richardson and Green (1997), although implementable, is likely to be inefficient and computationally expensive.
- Abramowitz, M. and Stegun, I.A. (1972). Stirling Numbers of the Second Kind. In Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, (M. Abramowitz and I.A. Stegun, eds.). Dover, New York, pp. 824–825.
- Antoniak, C.E. (1974). Mixtures of Dirichlet processes with applications to nonparametric problems. Ann. Statist., 2, 1152–1174. CrossRef
- Bell, E.T. (1934). Exponential numbers. Amer. Math. Monthly, 41, 411–419. CrossRef
- Bhattacharya, S. (2008). Gibbs sampling based Bayesian analysis of mixtures with unknown number of components. Sankhya. Series B, 70, 133–155.
- Bush, C.A. and MacEachern, S.N. (1996). A semiparametric Bayesian model for randomised block designs. Biometrika, 83, 275–285. CrossRef
- Carlin, B.P., Gelfand, A.E. and Smith, A.F.M. (1992). Hierarchical Bayesian analysis of changepoint problems. Appl. Stat., 41, 389–405. CrossRef
- Dahl, D. (2009). Modal clustering in a class of product partition models. Bayesian Anal., 4, 243–264. CrossRef
- Dalal, S.R. and Hall, W.J. (1983). Approximating priors by mixtures of natural conjugate priors. J. R. Stat. Soc. Ser. B., 45, 278–286.
- Diaconis, P. and Ylvisaker, D. (1985). Quantifying Prior Opinion (with discussion). In Bayesian Statistics 2 (J.-M. Bernardo, M.H. DeGroot, D.V. Lindley and A.F.M. Smith, eds.). North-Holland, Amsterdam, pp. 133–156.
- Escobar, M.D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc., 90, 577–588. CrossRef
- Ferguson, T.S. (1974). A Bayesian analysis of some nonparametric problems. Ann. Statist., 1, 209–230. CrossRef
- Ghosh, J.K., Dihidar, K. and Samanta, T. (2009). On Different Clusterings of the Same Data Set. In Felicitation Volume in Honour of Prof. B. K. Kale, ( B. Arnold, U. Gather and S.M. Bendre, eds.). MacMillan, New Delhi.
- Jain, S. and Neal, R.M. (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Statist., 13, 158–182. CrossRef
- Jain, S. and Neal, R.M. (2007). Splitting and merging components of a nonconjugate Dirichlet process mixture model (with discussion). Bayesian Anal., 2, 445–472. CrossRef
- Jensen, S.T. and Liu, J.S. (2008). Bayesian clustering of transcription factor binding Motiffs. J. Amer. Statist. Assoc., 103, 188–200. CrossRef
- Lee, K., Marin, J.-M., Mengersen, K. and Robert, C.P. (2008). Bayesian inference on mixtures of distributions.
- MacEachern, S.N. (1994). Estimating normal means with a conjugate-style Dirichlet process prior. Comm. Statist. Simulation Comput., 23, 727–741. CrossRef
- Mccullagh, P. and Yang, J. (2008). How many clusters? Bayesian Anal., 3, 101–120. CrossRef
- Mclachlan, G.J. and Basford, K.E. (1988). Mixture Models: Inference and Applications to Clustering. Dekker, New York.
- Mukhopadhyay, S. and Bhattacharya, S. (2012). Perfect simulation for mixtures with known and unknown number of components. Bayesian Anal., 7, 675–714. CrossRef
- Mukhopadhyay, S., Bhattacharya, S. and Dihidar, K. (2011). On Bayesian central clustering: application to landscape classification of Western Ghats. Ann. Appl. Stat., 5, 1948–1977. CrossRef
- Müller, P., Erkanli, A. and West, M. (1996). Bayesian curve fitting using multivariate normal mixtures. Biometrika, 83, 67–79. CrossRef
- Quintana, F.A. and Iglesias, P.L. (2003). Bayesian clustering and product partition models. J. R. Stat. Soc. Ser. B., 65, 557–574. CrossRef
- Richardson, S. and Green, P.J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. R. Stat. Soc. Ser. B., 59, 731–792. CrossRef
- Titterington, D.M., Smith, A.F.M. and Makov, U.E. (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, New York.
- Wang, L. and Dunson, D. (2011). Fast Bayesian inference in Dirichlet process mixture models. J. Comput. Graph. Statist., 20, 196–216. CrossRef
- Fast and efficient Bayesian semi-parametric curve-fitting and clustering in massive data
Volume 74, Issue 1 , pp 77-106
- Cover Date
- Print ISSN
- Online ISSN
- Additional Links
- Cluster analysisa
- Dirichlet process
- Model validation
- Markov chain Monte Carlo
- Non-linear regression
- Reversible jump Markov chain Monte Carlo
- Primary 62G08
- Secondary 91C20
- Author Affiliations
- 1. Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata, 700 108, India
- 2. Physics and Applied Mathematics Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata, 700 108, India